All of lore.kernel.org
 help / color / mirror / Atom feed
* Linux 2.6.34-rc3
@ 2010-03-30 17:50 Linus Torvalds
  2010-03-30 21:16 ` [Regression, post-rc2] Commit a5ee4eb7541 breaks OpenGL on RS780 (was: Re: Linux 2.6.34-rc3) Rafael J. Wysocki
  2010-04-02 17:59 ` Ugly rmap NULL ptr deref oopsie on hibernate (was " Borislav Petkov
  0 siblings, 2 replies; 242+ messages in thread
From: Linus Torvalds @ 2010-03-30 17:50 UTC (permalink / raw)
  To: Linux Kernel Mailing List


Ok, so -rc2 was messy, no question about it. I'm too much of a softie to 
hold back some peoples work, so my hard-line -rc1 didn't work out the way 
I wanted. But _next_ time! For sure this time.

Anyway, from a messy -rc2 we now have a -rc3 that should be in much better 
shape. Regressions fixed, and the ShortLog is short enough to be worth 
posting to lkml (-rc1 never is, and -rc2 seldom is. It's not like -rc2's 
are generally wondeful, this time around wasn't _that_ much different).

One regression fix that is worth pointing out is the EXT3_STATE_NEW 
handling in ext3, because the regression that one fixed was potentially 
quite nasty. It wouldn't cause data corruption, but it _could_ cause 
at least corrupt security labels.

So if you have SELinux enabled (either in permissive or enforcing mode) 
_and_ you ran 2.6.32-rc[12] _and_ your filesystem is ext3, you should not 
just update, you should make sure your extended attributes are fixed. The 
easiest way to fix them is likely to just check the "Relabel on next boot" 
checkmark in the SELinux config GUI ("system-config-selinux" if you don't 
do that whole admin menu thing), and reboot into 2.6.34-rc3.

[ And you can use something like 'restorecon -rv /' or whatever after 
  booting into the fixed kernel. See your nearest SELinux manual for real 
  details. ]

You might even want to do the whole "touch /forcefsck" before rebooting to 
make sure fsck runs (I don't think it matters, but it won't hurt - the 
relabeling will be so slow that whatever time your fsck takes is totally 
irrelevant, so do them both and get it over with).

Of course, I suspect most people who run experimental kernels are also the 
kinds of people who have turned off SELinux in annoyance long ago, or tend 
to be the kinds of people who long since upgraded to ext4 (which didn't 
have the problem), but hey, what do I know.

In short - if you started seeing odd security messages after running early 
2.6.34 -rc kernels, now you know what was going on.

Other than that? Random fixes and updates all over. Mostly drivers and 
filesystems, and mostly fairly small things. If you had PCI resource 
conflict problems with the early -rc's due to the _CRS window thing, for 
example, that should hopefully be fixed. See the appended shortlog for 
other details.

		Linus

---
Abraham Arce (1):
      KS8851: Avoid NULL pointer in set rx mode

Adel Gadllah (1):
      iwlwifi: Silence tfds_in_queue message

Adrian Hunter (1):
      mmc: fix incorrect interpretation of card type bits

Al Viro (1):
      Restore LOOKUP_DIRECTORY hint handling in final lookup on open()

Alexander Duyck (3):
      igb: only use vlan_gro_receive if vlans are registered
      skbuff: remove unused dma_head & dma_maps fields
      igb: use correct bits to identify if managability is enabled

Alexandra Kossovsky (1):
      tcp: Fix OOB POLLIN avoidance.

Amerigo Wang (1):
      netpoll: warn when there are spaces in parameters

Ameya Palande (1):
      regulator: Get rid of lockdep warning

Amit Kumar Salecha (4):
      netxen: fix bios version calculation
      netxen: fix warning in ioaddr for NX3031 chip
      netxen: added sanity check for pci map
      netxen: update version to 4.0.73

Amit Shah (2):
      virtio: console: Generate a kobject CHANGE event on adding 'name' attribute
      virtio: console: Check if port is valid in resize_console

Andreas Bombe (1):
      sh64: Remove long unused mid_sched macro

Andreas Herrmann (1):
      x86, amd: Restrict usage of c1e_idle()

Andrei Emeltchenko (1):
      Bluetooth: Fix kernel crash on L2CAP stress tests

Andrew Morton (2):
      timer stats: Fix del_timer_sync() and try_to_del_timer_sync()
      kernel/sched.c: Suppress unused var warning

Andy Gospodarek (1):
      bonding: fix broken multicast with round-robin mode

Anton Blanchard (1):
      ppc64 sys_ipc breakage in 2.6.34-rc2

Arnaldo Carvalho de Melo (2):
      perf top: Improve the autosizing of column lenghts
      perf top: Add missing initialization to zero

Axel Lin (2):
      lp3971: Fix setting val for LDO2 and LDO4
      lp3971: Fix BUCK_VOL_CHANGE_SHIFT logic

Ben Blum (1):
      cgroups: net_cls as module

Ben Menchaca (1):
      gianfar: fix undo of reserve()

Benjamin Li (1):
      bnx2: Fix netpoll crash.

Bjorn Helgaas (11):
      resources: add interfaces that return conflict information
      PCI: for address space collisions, show conflicting resource
      PCI: break out primary/secondary/subordinate for readability
      PCI: make disabled window printk style match the enabled ones
      PCI: print resources consistently with %pR
      PCI: complain about devices that seem to be broken
      PCI: don't say we claimed a resource if we failed
      x86/PCI: remove redundant warnings
      frv/PCI: remove redundant warnings
      x86/PCI: for host bridge address space collisions, show conflicting resource
      x86/PCI: truncate _CRS windows with _LEN > _MAX - _MIN + 1

Borislav Petkov (2):
      edac, mce: Filter out invalid values
      fs/binfmt_aout.c: fix pointer warnings

Brandon L Black (1):
      net: Add MSG_WAITFORONE flag to recvmmsg

Carolyn Wyborny (1):
      igb: Add support for 82576 ET2 Quad Port Server Adapter

Cheng Renquan (1):
      ceph: some documentations fixes

Chris Leech (1):
      ixgbe: filter FIP frames into the FCoE offload queues

Chris Wilson (1):
      drm/i915: Avoid NULL deref in get_pages() unwind after error.

Christian Borntraeger (1):
      [S390] system.h: Fix compile error for 1 and 2 byte cmpxchg

Christian Lamparter (2):
      [ARM] Kirkwood: WPS button keycode mapping
      [ARM] Orion5x: replace KEY_WLAN with KEY_WPS_BUTTON

Clemens Ladisch (4):
      firewire: core: fw_iso_resource_manage: fix error handling
      firewire: ohci: add cycle timer quirk for the TI TSB12LV22
      ALSA: cmipci: work around invalid PCM pointer
      PCI quirk: RS780/RS880: work around missing MSI initialization

Colin Ian King (1):
      softlockup: Stop spurious softlockup messages due to overflow

Crane Cai (1):
      i2c-scmi: Support IBM SMBus CMI devices

Daisuke Nishimura (1):
      memcg: disable move charge in no mmu case

Dan Carpenter (11):
      drm/i915: fix small leak on overlay error path
      sunrpc: handle allocation errors from __rpc_lookup_create()
      pxa168fb: fix incorrect resource calculation
      AFS: Potential null dereference
      regulator: handle kcalloc() failure
      ceph: handle kmalloc() failure
      af_key: return error if pfkey_xfrm_policy2msg_prep() fails
      memcontrol: fix potential null deref
      kcore: fix test for end of list
      fscache: add missing unlock
      hwmon: (w83793) Saving negative errors in unsigned

Daniel Chen (1):
      ALSA: ac97: Add Toshiba P500 to ac97 jack sense blacklist

Daniel Mack (2):
      ASoC: pxa-pcm-lib: initialize DMA channel to -1
      [ARM] pxa/raumfeld: fix button name

Daniel T Chen (3):
      ALSA: hda: Fix 0 dB offset for HP laptops using CX20551 (Waikiki)
      ALSA: ac97: Add IBM ThinkPad R40e to Headphone/Line Jack Sense blacklist
      ALSA: hda: Use LPIB for ga-ma770-ud3 board

Daniel Taylor (1):
      fs/partitions/msdos: add support for large disks

Daniel Vetter (1):
      drm/intel: fix up set_tiling for untiled->tiled transition

Darrick J. Wong (2):
      acpi: Support IBM SMBus CMI devices
      i2c-scmi: Provide module aliases for automatic loading

Dave Airlie (1):
      slow-work: use get_ref wrapper instead of directly calling get_ref

David Howells (9):
      nommu: fix an incorrect comment in the do_mmap_shared_file()
      Document Linux's circular buffering capabilities
      FDPIC: For-loop in elf_core_vma_data_size() is incorrect
      do_sync_read/write() should set kiocb.ki_nbytes to be consistent
      NOMMU: Revert 'nommu: get_user_pages(): pin last page on non-page-aligned start'
      NOMMU: Fix __get_user_pages() to pin last page on offset buffers
      SLOW_WORK: CONFIG_SLOW_WORK_PROC should be CONFIG_SLOW_WORK_DEBUG
      frv/chris: fix lines with a missing semicolons
      KEYS: Add MAINTAINERS record

David Härdeman (1):
      kfifo: fix KFIFO_INIT in include/linux/kfifo.h

David S. Miller (7):
      via-velocity: Fix FLOW_CNTL_TX_RX handling in set_mii_flow_control()
      isdn: Add netdev to lists in MAINTAINERS entry.
      Revert "r8169: enable 64-bit DMA by default for PCI Express devices (v2)"
      Revert "via82cxxx: workaround h/w bugs"
      tulip: Add missing parens.
      Revert "ide: skip probe if there are no devices on the port (v2)"
      sparc64: Properly truncate pt_regs framepointer in perf callback.

Dean Nelson (4):
      PCI: fix return value from pcix_get_max_mmrbc()
      PCI: fix access of PCI_X_CMD by pcix get and set mmrbc functions
      PCI: cleanup error return for pcix get and set mmrbc functions
      hwmon: (coretemp) Add missing newline to dev_warn() message

Derek Kelly (1):
      ALSA: hda - Add support of Nvidia GT220 HDMI

Dmitry Torokhov (1):
      Regulators: max8925-regulator - clean up driver data after removal

Dominik Brodowski (4):
      pcmcia: do not use ioports < 0x100 on x86
      pcmcia: allow for four multifunction subdevices (again)
      power: support _noirq actions on device types and classes
      pcmcia: use dev_pm_ops for class pcmcia_socket_class

Emil Tantilov (4):
      igb: do not modify tx_queue_len on link speed change
      igbvf: do not modify tx_queue_len on link speed change
      e1000e: do not modify tx_queue_len on link speed change
      e1000: do not modify tx_queue_len on link speed change

Eric Anholt (6):
      drm/i915: Don't bother with the BKL for GEM ioctls.
      drm/i915: Enable VS timer dispatch.
      agp/intel: Respect the GTT size on Sandybridge for scratch page setup.
      agp/intel: Don't do the chipset flush on Sandybridge.
      drm/i915: Set up the documented clock gating on Sandybridge and Ironlake.
      drm/i915: Stop trying to use ACPI lid status to determine LVDS connection.

Eric Dumazet (3):
      net: Potential null skb->dev dereference
      netfilter: xt_hashlimit: dl_seq_stop() fix
      netfilter: xt_hashlimit: IPV6 bugfix

Eric Miao (3):
      [ARM] mmp: fix for variables in uncompress.h being discarded
      [ARM] pxa: remove unnecessary 'select FB_W100' from some platforms
      [ARM] pxa/sharpsl: add dependency of max1111 driver to sharpsl_pm

Eric Sandeen (1):
      ext4: Fixed inode allocator to correctly track a flex_bg's used_dirs

Eric W. Biederman (1):
      netxen: The driver doesn't work on NX_P3_B1 so cause probe to fail.

FUJITA Tomonori (1):
      Documentation: rename PCI/PCI-DMA-mapping.txt to DMA-API-HOWTO.txt

Felix Fietkau (1):
      ath9k: fix BUG_ON triggered by PAE frames

Francois Romieu (1):
      r8169: fix broken register writes

Grazvydas Ignotas (1):
      wl1251: fix potential crash

Greg Rose (7):
      ixgbevf: Fix VF Stats accounting after reset
      ixgbevf: Shorten up delay timer for watchdog task
      ixgbevf: Message formatting cleanups
      ixgbevf: Fix signed/unsigned int error
      ixgbe: In SR-IOV mode insert delay before bring the adapter up
      ixgbe: Change where clear_to_send_flag is reset to zero.
      ixgbe: Do not run all Diagnostic offline tests when VFs are active

Greg Thelen (1):
      memcg: fix typo in memcg documentation

Guennadi Liakhovetski (3):
      ASoC: SIU driver shall select FW_LOADER
      SH: fix SCIFA SCASCR register bit definitions
      SH: remove superfluous warning from the serial driver

Guenter Roeck (1):
      ipv4: Don't drop redirected route cache entry unless PTMU actually expired

Guo-Fu Tseng (3):
      jme: Fix VLAN memory leak
      jme: Protect vlgrp structure by pause RX actions.
      jme: Advance driver version number

H Hartley Sweeten (2):
      [ARM] locomo: fix SPI register offset
      [ARM] locomo: fix unpaired spin_lock_irqsave

Hans-Joachim Picht (1):
      [S390] fix broken proc interface for sclp_async

Heiko Carstens (2):
      [S390] smp: fix lowcore allocation
      [S390] sclp: avoid 64 bit division

Henrik Kretzschmar (5):
      genirq: Move two IRQ functions from .init.text to .text
      isdn: Cleanup Sections in PCMCIA driver sedlbauer
      isdn: Cleanup Sections in PCMCIA driver teles
      isdn: Cleanup Sections in PCMCIA driver avma1
      isdn: Cleanup Sections in PCMCIA driver elsa

Herbert Xu (1):
      ipv6: Remove redundant dst NULL check in ip6_dst_check

Huang Weiyi (1):
      [ARM] pxa/raumfeld: remove duplicated #include

Ian Campbell (1):
      x86: Do not free zero sized per cpu areas

Jan Beulich (1):
      x86: Fix placement of FIX_OHCI1394_BASE

Jan Kara (2):
      ext4: Fix estimate of # of blocks needed to write indirect-mapped files
      ext4: Don't use delayed allocation by default when used instead of ext3

Jani Nikula (1):
      c2port: fix device_create() return value check

Jarkko Nikula (1):
      ALSA: pcm_lib - fix xrun functionality

Jaswinder Singh Rajput (1):
      hwmon: (asc7621) Add X58 entry in Kconfig

Jeff Dike (1):
      vhost: fix error path in vhost_net_set_backend

Jeff Layton (1):
      NFS: don't try to decode GETATTR if DELEGRETURN returned error

Jeff Mahoney (2):
      reiserfs: fix oops while creating privroot with selinux enabled
      reiserfs: properly honor read-only devices

Jens Rottmann (1):
      ksz884x: fix return value of netdev_set_eeprom

Jiri Kosina (1):
      x86: Remove excessive early_res debug output

Joe Perches (3):
      drivers/gpu/drm/i915/intel_bios.c: fix continuation line formats
      MAINTAINERS: use tab not spaces for delimiter
      drivers/net: Fix continuation lines

Joern Engel (12):
      Open segment file before using it
      Limit max_pages for insane devices
      Plug memory leak in writeseg_end_io
      Prevent schedule while atomic in __logfs_readdir
      Write out both superblocks on mismatch
      Fix logfs_get_sb_final error path
      Use deactivate_locked_super
      Prevent data corruption in logfs_rewrite_block()
      Simplify and fix pad_wbuf
      [LogFS] Clear PagePrivate when moving journal
      [LogFS] Move reserved segments with journal
      [LogFS] Erase new journal segments

John Fastabend (1):
      ixgbe: cleanup maximum number of tx queues

John Stultz (1):
      time: Fix accumulation bug triggered by long delay.

Jon Maloy (1):
      TIPC: Removed inactive maintainer

Jonathan Cameron (2):
      [ARM] pxa: fix for variables in uncompress.h being discarded
      [ARM] pxa: remove spi cs gpio direction to avoid clash with driver

JosephChan@via.com.tw (2):
      pata_via: Add VIA VX900 support
      pata_via: fix VT6410/6415/6330 detection issue

Jozsef Kadlecsik (1):
      netfilter: ip6table_raw: fix table priority

Julia Lawall (2):
      sound/oss/vidc.c: change the field used with DMA_ACTIVE
      arch/sparc/kernel: Use set_cpus_allowed_ptr

KOSAKI Motohiro (6):
      sched: sched_getaffinity(): Allow less than NR_CPUS length
      sched: Use proper type in sched_getaffinity()
      tmpfs: mpol=bind:0 don't cause mount error.
      tmpfs: handle MPOL_LOCAL mount option properly
      tmpfs: cleanup mpol_parse_str()
      doc: add the documentation for mpol=local

Ken Kawasaki (1):
      pcnet_cs: add new id

Komuro (1):
      pd6729: Coding Style fixes

Kunal Gangakhedkar (1):
      ALSA: hda - Add PCI quirk for HP dv6-1110ax.

Kuninori Morimoto (3):
      sh: mach-ecovec24: Add i2c_put_adapter on sh_eth_init
      sh: ms7724: Add tiny-document for sound
      sh: Add watch-dog register address for SH7722/SH7723/SH7724

Kyle McMartin (1):
      tulip: Fix null dereference in uli526x_rx_packet()

Lai Jiangshan (2):
      rcu: Fix tracepoints & lockdep false positive
      rcu: Fix local_irq_disable() CONFIG_PROVE_RCU=y false positives

Lee Schermerhorn (1):
      mempolicy: fix get_mempolicy() for relative and static nodes

Lennart Schulte (1):
      tcp: Fix tcp_mark_head_lost() with packets == 0

Li Zefan (1):
      cgroups: remove duplicate include

Linus Torvalds (3):
      Fix up prototype for sys_ipc breakage
      ext3: fix broken handling of EXT3_STATE_NEW
      Linux 2.6.34-rc3

Magnus Damm (1):
      serial: sh-sci: fix SH-Mobile SH breakage

Mallikarjuna R Chilakala (3):
      ixgbe: Fix 82599 multispeed fiber link issues due to Tx laser flapping
      ixgbe: Fix 82599 KX4 Wake on LAN issue after an improper system shutdown
      ixgbe: Set IXGBE_RSC_CB(skb)->DMA field to zero after unmapping the address

Marcel Holtmann (2):
      Bluetooth: Fix potential bad memory access with sysfs files
      Bluetooth: Convert debug files to actually use debugfs instead of sysfs

Mark Brown (2):
      ASoC: Bail out of wm_hubs DC servo if calibration fails
      ASoC: Remove BROKEN from i.MX audio after dependencies merged

Mark Fasheh (3):
      ocfs2: set i_mode on disk during acl operations
      ocfs2: Always try for maximum bits with new local alloc windows
      ocfs2: Clear undo bits when local alloc is freed

Martin Schwidefsky (1):
      [S390] fix boot failures with compressed kernels

Masami Hiramatsu (4):
      perf probe: Fix probe_point buffer overrun
      perf probe: Fix need_dwarf flag if lazy matching is used
      perf probe: Fix offset to allow signed value
      perf probe: Use original address instead of CU-based address

Mathieu Desnoyers (1):
      CRED: Fix memory leak in error handling

Matt Fleming (3):
      sh: Flush ITLB too in PTEAEX's flush_tlb_page()
      sh: Replace unsafe manipulation of MMUCR
      sh: Fix build after dynamic PMB rework

Matthew Wilcox (1):
      PCI quirk: Disable MSI on VIA K8T890 systems

Miao Xie (2):
      cpuset: fix the problem that cpuset_mem_spread_node() returns an offline node
      cpuset: alloc nodemask_t on the heap rather than the stack

Michael Chan (1):
      bnx2: Use proper handler during netpoll.

Michael Grzeschik (1):
      lxfb: set the H- and V-SYNC polarity of the flatpanel output

Michael Holzheu (1):
      [S390] zcore: CPU registers are not saved under LPAR

Michael S. Tsirkin (3):
      vhost: fix interrupt mitigation with raw sockets
      vhost: fix error handling in vring ioctls
      exit: fix oops in sync_mm_rss

Mike Frysinger (2):
      can: bfin_can: switch to common Blackfin can header
      blackfin: enable DEBUG_SECTION_MISMATCH

Mitch Williams (1):
      igb: count Rx FIFO errors correctly

Neil Horman (1):
      r8169: offical fix for CVE-2009-4537 (overlength frame DMAs)

Nick Bowler (1):
      Staging: et131x: Properly disable FC in txmac.

Nicolas Dichtel (1):
      net: ipmr/ip6mr: prevent out-of-bounds vif_table access

OGAWA Hirofumi (1):
      fs/partition/msdos: fix unusable extended partition for > 512B sector

Owain G. Ainsworth (1):
      drm/i915: remove an unnecessary wait_request()

Pablo Neira Ayuso (3):
      netlink: fix unaligned access in nla_get_be64()
      netlink: fix NETLINK_RECV_NO_ENOBUFS in netlink_set_err()
      netfilter: ctnetlink: fix reliable event delivery if message building fails

Patrick McHardy (3):
      net: ipmr/ip6mr: fix potential out-of-bounds vif_table access
      netfilter: xt_recent: fix regression in rules using a zero hit_count
      net: fix netlink address dumping in IPv4/IPv6

Paul E. McKenney (2):
      rcu: Make rcu_read_lock_bh_held() allow for disabled BH
      net: suppress lockdep-RCU false positive in FIB trie.

Paul Mackerras (1):
      powerpc/perf_events: Fix call-graph recording, add perf_arch_fetch_caller_regs

Paul Mundt (3):
      PCI: kill off pci_register_set_vga_state() symbol export.
      sh: Tidy up a couple of section mismatches.
      sh: Silence unintialized variable warnings in dwarf unwinder.

Paulius Zaleckas (1):
      if_tunnel.h: add missing ams/byteorder.h include

Pavel Emelyanov (2):
      ipv4: Cleanup struct net dereference in rt_intern_hash
      ipv4: Restart rt_intern_hash after emergency rebuild (v2)

Peter Ujfalusi (2):
      ASoC: tlv320dac33: Fix DSP modes
      ASoC: tlv320dac33: Internal clocking changes

Prarit Bhargava (1):
      hwmon: (coretemp) Fix cpu model output

Priit Laes (1):
      drm/i915: Rename FBC_C3_IDLE to FBC_CTL_C3_IDLE to match other registers

Rafael J. Wysocki (1):
      x86 / perf: Fix suspend to RAM on HP nx6325

Randy Dunlap (2):
      scripts/kernel-doc: handle struct member __aligned
      scripts/kernel-doc: fix fatal error on function prototype

Ravikiran G Thirumalai (1):
      tmpfs: fix oops on mounts with mpol=default

Richard Röjfors (1):
      drivers/gpio/max730x.c: add license macro

Rob Landley (1):
      sparc: Fix use of uid16_t and gid16_t in asm/stat.h

Robert Love (2):
      ixgbe: Don't allow user buffer count to exceed 256
      ixgbe: Priority tag FIP frames

Robin Holt (1):
      mm/ksm.c is doing an unneeded _notify in write_protect_page.

Russell King (3):
      ARM: Fix IXP23xx build error in mach/memory.h
      ARM: Update mach-types
      Documentation/volatile-considered-harmful.txt: correct cpu_relax() documentation

Ryusuke Konishi (3):
      nilfs2: fix duplicate call to nilfs_segctor_cancel_freev
      nilfs2: fix hang-up of cleaner after log writer returned with error
      nilfs2: fix imperfect completion wait in nilfs_wait_on_logs

Sachin Prabhu (1):
      Skip check for mandatory locks when unlocking

Sage Weil (26):
      ceph: implemented caps should always be superset of issued caps
      ceph: add missing locking to protect i_snap_realm_item during split
      ceph: fix inode removal from snap realm when racing with migration
      ceph: fix authenticator timeout
      ceph: fix authenticator buffer size calculation
      ceph: release old ticket_blob buffer
      ceph: clean up service ticket decoding
      ceph: fix null pointer deref of r_osd in debug output
      ceph: drop unnecessary WARN_ON in caps migration
      ceph: fix session locking in handle_caps, ceph_check_caps
      ceph: clean up handle_cap_grant, handle_caps wrt session mutex
      ceph: only release unused caps with mds requests
      ceph: fix mds sync() race with completing requests
      ceph: fix pg pool decoding from incremental osdmap update
      ceph: prevent dup stale messages to console for restarting mds
      ceph: fix connection fault con_work reentrancy problem
      ceph: rename r_sent_stamp r_stamp
      ceph: avoid reopening osd connections when address hasn't changed
      ceph: fix snap rebuild condition
      ceph: make write_begin wait propagate ERESTARTSYS
      ceph: propagate mds session allocation failures to caller
      ceph: fix session check on mds reply
      ceph: fix possible double-free of mds request reference
      ceph: avoid loaded term 'OSD' in documention
      ceph: fix use after free on mds __unregister_request
      ceph: update discussion list address in MAINTAINERS

Srinivas Eeda (1):
      ocfs2: Fix a race in o2dlm lockres mastery

Stanislaw Gruszka (1):
      posix-cpu-timers: Reset expire cache when no timer is running

Stefan Haberland (1):
      [S390] dasd: check tsb validity

Stefan Richter (2):
      firewire: core: fix Model_ID in modalias
      firewire: core: align driver match with modalias

Stefan Weinhuber (1):
      [S390] dasd: fix alignment of transport mode recovery TCW

Steve Glendinning (1):
      smsc95xx: Fix tx checksum offload for small packets

Steven J. Magnani (1):
      NET_DMA: free skbs periodically

Steven Rostedt (1):
      ring-buffer: Do 8 byte alignment for 64 bit that can not handle 4 byte align

Suresh Siddha (1):
      x86: Handle legacy PIC interrupts on all the cpu's

Takashi Iwai (3):
      ALSA: hda - Sort codec entry list of Nvidia HDMI
      ALSA: hda - Fix access-after-free in patch_realtek.c
      ALSA: hda - Don't set invalid connection index in Realtek initialiaiton

Tao Ma (4):
      ocfs2: Change bg_chain check for ocfs2_validate_gd_parent.
      ocfs2: Update i_blocks in reflink operations.
      ocfs2: Fix the update of name_offset when removing xattrs
      ocfs2: Init meta_ac properly in ocfs2_create_empty_xattr_block.

Tejun Heo (1):
      libata-sff: fix spurious IRQ handling

Tetsuo Handa (2):
      rxrpc: Check allocation failure.
      rxrpc: Check allocation failure.

Theodore Ts'o (1):
      ext4: Fix spelling of CONTIG_FS_EXT3 to CONFIG_FS_EXT3

Thomas Gleixner (3):
      genirq: Prevent oneshot irq thread race
      clockevents: Sanitize min_delta_ns adjustment and prevent overflows
      genirq: Protect access to irq_desc->action in can_request_irq()

Thomas Weber (1):
      OMAP: DSS2: VRAM: Fix early_param for vram

Tim Yamin (1):
      PCI quirk: only apply CX700 PCI bus parking quirk if external VT6212L is present

Timo Teräs (2):
      ipv4: check rt_genid in dst_check
      ip_gre: include route header_len in max_headroom calculation

Tomi Valkeinen (2):
      OMAP: DSS2: initialize dss clk sources properly
      OMAP: DSS2: panel-generic: re-implement mode changing

Tristan Ye (2):
      Ocfs2: Journaling i_flags and i_orphaned_slot when adding inode to orphan dir.
      Ocfs2: Handle deletion of reflinked oprhan inodes correctly.

Trond Myklebust (4):
      NFS: Prevent another deadlock in nfs_release_page()
      SUNRPC: Fix a potential memory leak in auth_gss
      SUNRPC: Fix a use after free bug with the NFSv4.1 backchannel
      SUNRPC: Fix the return value of rpc_run_bc_task()

Uwe Kleine-König (1):
      rtc/mc13783: fix use after free bug

Vasu Dev (3):
      ixgbe: fix for real_num_tx_queues update issue
      vlan: adds vlan_dev_select_queue
      vlan: updates vlan real_num_tx_queues

Wolfram Sang (2):
      regulator: fix dangling pointers
      get_maintainer: repair STDIN usage

YOSHIFUJI Hideaki / 吉藤英明 (1):
      ipv6: Don't drop cache route entry unless timer actually expired.

Yegor Yefremov (1):
      KS8695: update ksp->next_rx_desc_read at the end of rx loop

Yinghai Lu (2):
      x86: Make smp_locks end with page alignment
      x86: Make sure free_init_pages() frees pages on page boundary

Zhenyu Wang (1):
      drm/i915: Fix check with IS_GEN6

stephen hemminger (1):
      TCP: check min TTL on received ICMP packets

wzt wzt (1):
      benet: Fix compile warnnings in drivers/net/benet/be_ethtool.c


^ permalink raw reply	[flat|nested] 242+ messages in thread

* [Regression, post-rc2] Commit a5ee4eb7541 breaks OpenGL on RS780 (was: Re: Linux 2.6.34-rc3)
  2010-03-30 17:50 Linux 2.6.34-rc3 Linus Torvalds
@ 2010-03-30 21:16 ` Rafael J. Wysocki
  2010-03-31 20:34   ` [stable] " Greg KH
  2010-04-01  1:13   ` Rafael J. Wysocki
  2010-04-02 17:59 ` Ugly rmap NULL ptr deref oopsie on hibernate (was " Borislav Petkov
  1 sibling, 2 replies; 242+ messages in thread
From: Rafael J. Wysocki @ 2010-03-30 21:16 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Linux Kernel Mailing List, Dave Airlie, dri-devel, Jesse Barnes,
	Linux PCI, Clemens Ladisch, Alex Deucher, stable, Greg KH

On Tuesday 30 March 2010, Linus Torvalds wrote:
...
> Other than that? Random fixes and updates all over. Mostly drivers and 
> filesystems, and mostly fairly small things. If you had PCI resource 
> conflict problems with the early -rc's due to the _CRS window thing, for 
> example, that should hopefully be fixed. See the appended shortlog for 
> other details.

...

> Clemens Ladisch (4):
>       firewire: core: fw_iso_resource_manage: fix error handling
>       firewire: ohci: add cycle timer quirk for the TI TSB12LV22
>       ALSA: cmipci: work around invalid PCM pointer
>       PCI quirk: RS780/RS880: work around missing MSI initialization

This one (commit a5ee4eb7541) broke OpenGL acceleration on my new test box
which happens to have a RS780.

The symptom is that every operation involving the GPU is _very_ slow, so the
window manager eventually disables compositing.  Reverting this commit makes
things work flawlessly again.

So, please revert.

BTW, I don't think it's a -stable material.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: [stable] [Regression, post-rc2] Commit a5ee4eb7541 breaks OpenGL on RS780 (was: Re: Linux 2.6.34-rc3)
  2010-03-30 21:16 ` [Regression, post-rc2] Commit a5ee4eb7541 breaks OpenGL on RS780 (was: Re: Linux 2.6.34-rc3) Rafael J. Wysocki
@ 2010-03-31 20:34   ` Greg KH
  2010-04-01  1:13   ` Rafael J. Wysocki
  1 sibling, 0 replies; 242+ messages in thread
From: Greg KH @ 2010-03-31 20:34 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Linus Torvalds, Linux PCI, Greg KH, Clemens Ladisch,
	Linux Kernel Mailing List, Jesse Barnes, Alex Deucher, dri-devel,
	Dave Airlie, stable

On Tue, Mar 30, 2010 at 11:16:45PM +0200, Rafael J. Wysocki wrote:
> On Tuesday 30 March 2010, Linus Torvalds wrote:
> ...
> > Other than that? Random fixes and updates all over. Mostly drivers and 
> > filesystems, and mostly fairly small things. If you had PCI resource 
> > conflict problems with the early -rc's due to the _CRS window thing, for 
> > example, that should hopefully be fixed. See the appended shortlog for 
> > other details.
> 
> ...
> 
> > Clemens Ladisch (4):
> >       firewire: core: fw_iso_resource_manage: fix error handling
> >       firewire: ohci: add cycle timer quirk for the TI TSB12LV22
> >       ALSA: cmipci: work around invalid PCM pointer
> >       PCI quirk: RS780/RS880: work around missing MSI initialization
> 
> This one (commit a5ee4eb7541) broke OpenGL acceleration on my new test box
> which happens to have a RS780.
> 
> The symptom is that every operation involving the GPU is _very_ slow, so the
> window manager eventually disables compositing.  Reverting this commit makes
> things work flawlessly again.
> 
> So, please revert.
> 
> BTW, I don't think it's a -stable material.

Ok, I'll go drop it.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: [Regression, post-rc2] Commit a5ee4eb7541 breaks OpenGL on RS780 (was: Re: Linux 2.6.34-rc3)
  2010-03-30 21:16 ` [Regression, post-rc2] Commit a5ee4eb7541 breaks OpenGL on RS780 (was: Re: Linux 2.6.34-rc3) Rafael J. Wysocki
  2010-03-31 20:34   ` [stable] " Greg KH
@ 2010-04-01  1:13   ` Rafael J. Wysocki
  2010-04-01  2:19       ` Alex Deucher
  2010-04-01 16:29     ` Linus Torvalds
  1 sibling, 2 replies; 242+ messages in thread
From: Rafael J. Wysocki @ 2010-04-01  1:13 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Linux Kernel Mailing List, Dave Airlie, dri-devel, Jesse Barnes,
	Linux PCI, Clemens Ladisch, Alex Deucher, stable, Greg KH

On Tuesday 30 March 2010, Rafael J. Wysocki wrote:
> On Tuesday 30 March 2010, Linus Torvalds wrote:
> ...
> > Other than that? Random fixes and updates all over. Mostly drivers and 
> > filesystems, and mostly fairly small things. If you had PCI resource 
> > conflict problems with the early -rc's due to the _CRS window thing, for 
> > example, that should hopefully be fixed. See the appended shortlog for 
> > other details.
> 
> ...
> 
> > Clemens Ladisch (4):
> >       firewire: core: fw_iso_resource_manage: fix error handling
> >       firewire: ohci: add cycle timer quirk for the TI TSB12LV22
> >       ALSA: cmipci: work around invalid PCM pointer
> >       PCI quirk: RS780/RS880: work around missing MSI initialization
> 
> This one (commit a5ee4eb7541) broke OpenGL acceleration on my new test box
> which happens to have a RS780.
> 
> The symptom is that every operation involving the GPU is _very_ slow, so the
> window manager eventually disables compositing.  Reverting this commit makes
> things work flawlessly again.
> 
> So, please revert.
> 
> BTW, I don't think it's a -stable material.

OK, I've verified that partial revert (below) is sufficient.

Rafael

---
From: Rafael J. Wysocki <rjw@sisk.pl>
Subject: DRM / radeon: Really do not try to enable MSI on RS780 and RS880

Commit a5ee4eb75413c145334c30e43f1af9875dad6fd7
(PCI quirk: RS780/RS880: work around missing MSI initialization)
removed a quirk to disable MSI on RS780 and RS880, which still is
necessary on my Acer Ferrari One, because pci_enable_msi() attempts
to enable the MSI and apparently succeeds despite the PCI quirk
added by that commit.  Add the removed radeon quirk again.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
---
 drivers/gpu/drm/radeon/radeon_irq_kms.c |    8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

Index: linux-2.6/drivers/gpu/drm/radeon/radeon_irq_kms.c
===================================================================
--- linux-2.6.orig/drivers/gpu/drm/radeon/radeon_irq_kms.c
+++ linux-2.6/drivers/gpu/drm/radeon/radeon_irq_kms.c
@@ -116,7 +116,13 @@ int radeon_irq_kms_init(struct radeon_de
 	}
 	/* enable msi */
 	rdev->msi_enabled = 0;
-	if (rdev->family >= CHIP_RV380) {
+       /* MSIs don't seem to work on my rs780;
+        * not sure about rs880 or other rs780s.
+        * Needs more investigation.
+        */
+       if ((rdev->family >= CHIP_RV380) &&
+           (rdev->family != CHIP_RS780) &&
+           (rdev->family != CHIP_RS880)) {
 		int ret = pci_enable_msi(rdev->pdev);
 		if (!ret) {
 			rdev->msi_enabled = 1;

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: [Regression, post-rc2] Commit a5ee4eb7541 breaks OpenGL on RS780  (was: Re: Linux 2.6.34-rc3)
  2010-04-01  1:13   ` Rafael J. Wysocki
@ 2010-04-01  2:19       ` Alex Deucher
  2010-04-01 16:29     ` Linus Torvalds
  1 sibling, 0 replies; 242+ messages in thread
From: Alex Deucher @ 2010-04-01  2:19 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Linus Torvalds, Linux PCI, Greg KH, Clemens Ladisch,
	Linux Kernel Mailing List, Jesse Barnes, Alex Deucher, dri-devel,
	stable, Dave Airlie

[-- Attachment #1: Type: text/plain, Size: 2991 bytes --]

On Wed, Mar 31, 2010 at 9:13 PM, Rafael J. Wysocki <rjw@sisk.pl> wrote:
> On Tuesday 30 March 2010, Rafael J. Wysocki wrote:
>> On Tuesday 30 March 2010, Linus Torvalds wrote:
>> ...
>> > Other than that? Random fixes and updates all over. Mostly drivers and
>> > filesystems, and mostly fairly small things. If you had PCI resource
>> > conflict problems with the early -rc's due to the _CRS window thing, for
>> > example, that should hopefully be fixed. See the appended shortlog for
>> > other details.
>>
>> ...
>>
>> > Clemens Ladisch (4):
>> >       firewire: core: fw_iso_resource_manage: fix error handling
>> >       firewire: ohci: add cycle timer quirk for the TI TSB12LV22
>> >       ALSA: cmipci: work around invalid PCM pointer
>> >       PCI quirk: RS780/RS880: work around missing MSI initialization
>>
>> This one (commit a5ee4eb7541) broke OpenGL acceleration on my new test box
>> which happens to have a RS780.
>>
>> The symptom is that every operation involving the GPU is _very_ slow, so the
>> window manager eventually disables compositing.  Reverting this commit makes
>> things work flawlessly again.
>>
>> So, please revert.
>>
>> BTW, I don't think it's a -stable material.
>
> OK, I've verified that partial revert (below) is sufficient.
>
> Rafael
>
> ---
> From: Rafael J. Wysocki <rjw@sisk.pl>
> Subject: DRM / radeon: Really do not try to enable MSI on RS780 and RS880
>
> Commit a5ee4eb75413c145334c30e43f1af9875dad6fd7
> (PCI quirk: RS780/RS880: work around missing MSI initialization)
> removed a quirk to disable MSI on RS780 and RS880, which still is
> necessary on my Acer Ferrari One, because pci_enable_msi() attempts
> to enable the MSI and apparently succeeds despite the PCI quirk
> added by that commit.  Add the removed radeon quirk again.
>
> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
> ---
>  drivers/gpu/drm/radeon/radeon_irq_kms.c |    8 +++++++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
>
> Index: linux-2.6/drivers/gpu/drm/radeon/radeon_irq_kms.c
> ===================================================================
> --- linux-2.6.orig/drivers/gpu/drm/radeon/radeon_irq_kms.c
> +++ linux-2.6/drivers/gpu/drm/radeon/radeon_irq_kms.c
> @@ -116,7 +116,13 @@ int radeon_irq_kms_init(struct radeon_de
>        }
>        /* enable msi */
>        rdev->msi_enabled = 0;
> -       if (rdev->family >= CHIP_RV380) {
> +       /* MSIs don't seem to work on my rs780;
> +        * not sure about rs880 or other rs780s.
> +        * Needs more investigation.
> +        */
> +       if ((rdev->family >= CHIP_RV380) &&
> +           (rdev->family != CHIP_RS780) &&
> +           (rdev->family != CHIP_RS880)) {
>                int ret = pci_enable_msi(rdev->pdev);
>                if (!ret) {
>                        rdev->msi_enabled = 1;

I also have the attached patch queued in via Dave's tree to disable
MSI on all IGP chips for the time being.

Alex

[-- Attachment #2: 0001-drm-radeon-kms-disable-MSI-on-IGP-chips.patch --]
[-- Type: application/mbox, Size: 1248 bytes --]

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: [Regression, post-rc2] Commit a5ee4eb7541 breaks OpenGL on RS780 (was: Re: Linux 2.6.34-rc3)
@ 2010-04-01  2:19       ` Alex Deucher
  0 siblings, 0 replies; 242+ messages in thread
From: Alex Deucher @ 2010-04-01  2:19 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Linus Torvalds, Linux PCI, Greg KH, Clemens Ladisch,
	Linux Kernel Mailing List, Jesse Barnes, Alex Deucher, dri-devel,
	stable, Dave Airlie

[-- Attachment #1: Type: text/plain, Size: 2991 bytes --]

On Wed, Mar 31, 2010 at 9:13 PM, Rafael J. Wysocki <rjw@sisk.pl> wrote:
> On Tuesday 30 March 2010, Rafael J. Wysocki wrote:
>> On Tuesday 30 March 2010, Linus Torvalds wrote:
>> ...
>> > Other than that? Random fixes and updates all over. Mostly drivers and
>> > filesystems, and mostly fairly small things. If you had PCI resource
>> > conflict problems with the early -rc's due to the _CRS window thing, for
>> > example, that should hopefully be fixed. See the appended shortlog for
>> > other details.
>>
>> ...
>>
>> > Clemens Ladisch (4):
>> >       firewire: core: fw_iso_resource_manage: fix error handling
>> >       firewire: ohci: add cycle timer quirk for the TI TSB12LV22
>> >       ALSA: cmipci: work around invalid PCM pointer
>> >       PCI quirk: RS780/RS880: work around missing MSI initialization
>>
>> This one (commit a5ee4eb7541) broke OpenGL acceleration on my new test box
>> which happens to have a RS780.
>>
>> The symptom is that every operation involving the GPU is _very_ slow, so the
>> window manager eventually disables compositing.  Reverting this commit makes
>> things work flawlessly again.
>>
>> So, please revert.
>>
>> BTW, I don't think it's a -stable material.
>
> OK, I've verified that partial revert (below) is sufficient.
>
> Rafael
>
> ---
> From: Rafael J. Wysocki <rjw@sisk.pl>
> Subject: DRM / radeon: Really do not try to enable MSI on RS780 and RS880
>
> Commit a5ee4eb75413c145334c30e43f1af9875dad6fd7
> (PCI quirk: RS780/RS880: work around missing MSI initialization)
> removed a quirk to disable MSI on RS780 and RS880, which still is
> necessary on my Acer Ferrari One, because pci_enable_msi() attempts
> to enable the MSI and apparently succeeds despite the PCI quirk
> added by that commit.  Add the removed radeon quirk again.
>
> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
> ---
>  drivers/gpu/drm/radeon/radeon_irq_kms.c |    8 +++++++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
>
> Index: linux-2.6/drivers/gpu/drm/radeon/radeon_irq_kms.c
> ===================================================================
> --- linux-2.6.orig/drivers/gpu/drm/radeon/radeon_irq_kms.c
> +++ linux-2.6/drivers/gpu/drm/radeon/radeon_irq_kms.c
> @@ -116,7 +116,13 @@ int radeon_irq_kms_init(struct radeon_de
>        }
>        /* enable msi */
>        rdev->msi_enabled = 0;
> -       if (rdev->family >= CHIP_RV380) {
> +       /* MSIs don't seem to work on my rs780;
> +        * not sure about rs880 or other rs780s.
> +        * Needs more investigation.
> +        */
> +       if ((rdev->family >= CHIP_RV380) &&
> +           (rdev->family != CHIP_RS780) &&
> +           (rdev->family != CHIP_RS880)) {
>                int ret = pci_enable_msi(rdev->pdev);
>                if (!ret) {
>                        rdev->msi_enabled = 1;

I also have the attached patch queued in via Dave's tree to disable
MSI on all IGP chips for the time being.

Alex

[-- Attachment #2: 0001-drm-radeon-kms-disable-MSI-on-IGP-chips.patch --]
[-- Type: application/mbox, Size: 1248 bytes --]

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: [Regression, post-rc2] Commit a5ee4eb7541 breaks OpenGL on RS780 (was: Re: Linux 2.6.34-rc3)
  2010-04-01  2:19       ` Alex Deucher
  (?)
@ 2010-04-01  6:36       ` Clemens Ladisch
  2010-04-01 15:01           ` Alex Deucher
  -1 siblings, 1 reply; 242+ messages in thread
From: Clemens Ladisch @ 2010-04-01  6:36 UTC (permalink / raw)
  To: Alex Deucher
  Cc: Rafael J. Wysocki, Linus Torvalds, Linux PCI, Greg KH,
	Linux Kernel Mailing List, Jesse Barnes, dri-devel, stable,
	Dave Airlie

Alex Deucher wrote:
> On Wed, Mar 31, 2010 at 9:13 PM, Rafael J. Wysocki <rjw@sisk.pl> wrote:
>> On Tuesday 30 March 2010, Rafael J. Wysocki wrote:
>>> >       PCI quirk: RS780/RS880: work around missing MSI initialization
>>>
>>> This one (commit a5ee4eb7541) broke OpenGL acceleration on my new test box
>>> which happens to have a RS780.

So it's better to disable MSI unconditionally.

Rafael, can you check if MSI works for the HDMI audio device?
(I'd guess it doesn't.)

> I also have the attached patch queued in via Dave's tree to disable
> MSI on all IGP chips for the time being.

This disables MSI only for the graphics device.  I'd prefer to have
the quirk on its bridge so that MSI gets disabled for the HDMI audio
device too, to avoid having to duplicate this quirk in the snd-hda-intel
driver.

==========

PCI quirk: RS780/RS880: disable MSI completely

The missing initialization of the nb_cntl.strap_msi_enable does not seem
to be the only problem that prevents MSI, so that quirk is not
sufficient to enable MSI on all machines.  To be safe, unconditionally
disable MSI for the internal graphics and HDMI audio on these chipsets.

Signed-off-by: Clemens Ladisch <clemens@ladisch.de>

--- a/drivers/pci/quirks.c
+++ b/drivers/pci/quirks.c
@@ -2123,6 +2123,8 @@ static void __devinit quirk_disable_msi(struct pci_dev *dev)
 	}
 }
 DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_8131_BRIDGE, quirk_disable_msi);
+DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_AMD, 0x9602, quirk_disable_msi);
+DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ASUSTEK, 0x9602, quirk_disable_msi);
 DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_VIA, 0xa238, quirk_disable_msi);
 
 /* Go through the list of Hypertransport capabilities and
@@ -2495,39 +2497,6 @@ DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ATI, 0x4374,
 DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ATI, 0x4375,
 			quirk_msi_intx_disable_bug);
 
-/*
- * MSI does not work with the AMD RS780/RS880 internal graphics and HDMI audio
- * devices unless the BIOS has initialized the nb_cntl.strap_msi_enable bit.
- */
-static void __init rs780_int_gfx_disable_msi(struct pci_dev *int_gfx_bridge)
-{
-	u32 nb_cntl;
-
-	if (!int_gfx_bridge->subordinate)
-		return;
-
-	pci_bus_write_config_dword(int_gfx_bridge->bus, PCI_DEVFN(0, 0),
-				   0x60, 0);
-	pci_bus_read_config_dword(int_gfx_bridge->bus, PCI_DEVFN(0, 0),
-				  0x64, &nb_cntl);
-
-	if (!(nb_cntl & BIT(10))) {
-		dev_warn(&int_gfx_bridge->dev,
-			 FW_WARN "RS780: MSI for internal graphics disabled\n");
-		int_gfx_bridge->subordinate->bus_flags |= PCI_BUS_FLAGS_NO_MSI;
-	}
-}
-
-#define PCI_DEVICE_ID_AMD_RS780_P2P_INT_GFX	0x9602
-
-DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_AMD,
-			PCI_DEVICE_ID_AMD_RS780_P2P_INT_GFX,
-			rs780_int_gfx_disable_msi);
-/* wrong vendor ID on M4A785TD motherboard: */
-DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ASUSTEK,
-			PCI_DEVICE_ID_AMD_RS780_P2P_INT_GFX,
-			rs780_int_gfx_disable_msi);
-
 #endif /* CONFIG_PCI_MSI */
 
 #ifdef CONFIG_PCI_IOV

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: [Regression, post-rc2] Commit a5ee4eb7541 breaks OpenGL on RS780  (was: Re: Linux 2.6.34-rc3)
  2010-04-01  6:36       ` Clemens Ladisch
@ 2010-04-01 15:01           ` Alex Deucher
  0 siblings, 0 replies; 242+ messages in thread
From: Alex Deucher @ 2010-04-01 15:01 UTC (permalink / raw)
  To: Clemens Ladisch
  Cc: Rafael J. Wysocki, Linus Torvalds, Linux PCI, Greg KH,
	Linux Kernel Mailing List, Jesse Barnes, dri-devel, stable,
	Dave Airlie

On Thu, Apr 1, 2010 at 2:36 AM, Clemens Ladisch <clemens@ladisch.de> wrote:
> Alex Deucher wrote:
>> On Wed, Mar 31, 2010 at 9:13 PM, Rafael J. Wysocki <rjw@sisk.pl> wrote:
>>> On Tuesday 30 March 2010, Rafael J. Wysocki wrote:
>>>> >       PCI quirk: RS780/RS880: work around missing MSI initialization
>>>>
>>>> This one (commit a5ee4eb7541) broke OpenGL acceleration on my new test box
>>>> which happens to have a RS780.
>
> So it's better to disable MSI unconditionally.
>
> Rafael, can you check if MSI works for the HDMI audio device?
> (I'd guess it doesn't.)
>
>> I also have the attached patch queued in via Dave's tree to disable
>> MSI on all IGP chips for the time being.
>
> This disables MSI only for the graphics device.  I'd prefer to have
> the quirk on its bridge so that MSI gets disabled for the HDMI audio
> device too, to avoid having to duplicate this quirk in the snd-hda-intel
> driver.
>
> ==========
>
> PCI quirk: RS780/RS880: disable MSI completely
>
> The missing initialization of the nb_cntl.strap_msi_enable does not seem
> to be the only problem that prevents MSI, so that quirk is not
> sufficient to enable MSI on all machines.  To be safe, unconditionally
> disable MSI for the internal graphics and HDMI audio on these chipsets.
>
> Signed-off-by: Clemens Ladisch <clemens@ladisch.de>

Works fine here.

Tested-by: Alex Deucher <alexdeucher@gmail.com>

>
> --- a/drivers/pci/quirks.c
> +++ b/drivers/pci/quirks.c
> @@ -2123,6 +2123,8 @@ static void __devinit quirk_disable_msi(struct pci_dev *dev)
>        }
>  }
>  DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_8131_BRIDGE, quirk_disable_msi);
> +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_AMD, 0x9602, quirk_disable_msi);
> +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ASUSTEK, 0x9602, quirk_disable_msi);
>  DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_VIA, 0xa238, quirk_disable_msi);
>
>  /* Go through the list of Hypertransport capabilities and
> @@ -2495,39 +2497,6 @@ DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ATI, 0x4374,
>  DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ATI, 0x4375,
>                        quirk_msi_intx_disable_bug);
>
> -/*
> - * MSI does not work with the AMD RS780/RS880 internal graphics and HDMI audio
> - * devices unless the BIOS has initialized the nb_cntl.strap_msi_enable bit.
> - */
> -static void __init rs780_int_gfx_disable_msi(struct pci_dev *int_gfx_bridge)
> -{
> -       u32 nb_cntl;
> -
> -       if (!int_gfx_bridge->subordinate)
> -               return;
> -
> -       pci_bus_write_config_dword(int_gfx_bridge->bus, PCI_DEVFN(0, 0),
> -                                  0x60, 0);
> -       pci_bus_read_config_dword(int_gfx_bridge->bus, PCI_DEVFN(0, 0),
> -                                 0x64, &nb_cntl);
> -
> -       if (!(nb_cntl & BIT(10))) {
> -               dev_warn(&int_gfx_bridge->dev,
> -                        FW_WARN "RS780: MSI for internal graphics disabled\n");
> -               int_gfx_bridge->subordinate->bus_flags |= PCI_BUS_FLAGS_NO_MSI;
> -       }
> -}
> -
> -#define PCI_DEVICE_ID_AMD_RS780_P2P_INT_GFX    0x9602
> -
> -DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_AMD,
> -                       PCI_DEVICE_ID_AMD_RS780_P2P_INT_GFX,
> -                       rs780_int_gfx_disable_msi);
> -/* wrong vendor ID on M4A785TD motherboard: */
> -DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ASUSTEK,
> -                       PCI_DEVICE_ID_AMD_RS780_P2P_INT_GFX,
> -                       rs780_int_gfx_disable_msi);
> -
>  #endif /* CONFIG_PCI_MSI */
>
>  #ifdef CONFIG_PCI_IOV
>

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: [Regression, post-rc2] Commit a5ee4eb7541 breaks OpenGL on RS780 (was: Re: Linux 2.6.34-rc3)
@ 2010-04-01 15:01           ` Alex Deucher
  0 siblings, 0 replies; 242+ messages in thread
From: Alex Deucher @ 2010-04-01 15:01 UTC (permalink / raw)
  To: Clemens Ladisch
  Cc: Rafael J. Wysocki, Linus Torvalds, Linux PCI, Greg KH,
	Linux Kernel Mailing List, Jesse Barnes, dri-devel, stable,
	Dave Airlie

On Thu, Apr 1, 2010 at 2:36 AM, Clemens Ladisch <clemens@ladisch.de> wrote:
> Alex Deucher wrote:
>> On Wed, Mar 31, 2010 at 9:13 PM, Rafael J. Wysocki <rjw@sisk.pl> wrote:
>>> On Tuesday 30 March 2010, Rafael J. Wysocki wrote:
>>>> >       PCI quirk: RS780/RS880: work around missing MSI initialization
>>>>
>>>> This one (commit a5ee4eb7541) broke OpenGL acceleration on my new test box
>>>> which happens to have a RS780.
>
> So it's better to disable MSI unconditionally.
>
> Rafael, can you check if MSI works for the HDMI audio device?
> (I'd guess it doesn't.)
>
>> I also have the attached patch queued in via Dave's tree to disable
>> MSI on all IGP chips for the time being.
>
> This disables MSI only for the graphics device.  I'd prefer to have
> the quirk on its bridge so that MSI gets disabled for the HDMI audio
> device too, to avoid having to duplicate this quirk in the snd-hda-intel
> driver.
>
> ==========
>
> PCI quirk: RS780/RS880: disable MSI completely
>
> The missing initialization of the nb_cntl.strap_msi_enable does not seem
> to be the only problem that prevents MSI, so that quirk is not
> sufficient to enable MSI on all machines.  To be safe, unconditionally
> disable MSI for the internal graphics and HDMI audio on these chipsets.
>
> Signed-off-by: Clemens Ladisch <clemens@ladisch.de>

Works fine here.

Tested-by: Alex Deucher <alexdeucher@gmail.com>

>
> --- a/drivers/pci/quirks.c
> +++ b/drivers/pci/quirks.c
> @@ -2123,6 +2123,8 @@ static void __devinit quirk_disable_msi(struct pci_dev *dev)
>        }
>  }
>  DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_8131_BRIDGE, quirk_disable_msi);
> +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_AMD, 0x9602, quirk_disable_msi);
> +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ASUSTEK, 0x9602, quirk_disable_msi);
>  DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_VIA, 0xa238, quirk_disable_msi);
>
>  /* Go through the list of Hypertransport capabilities and
> @@ -2495,39 +2497,6 @@ DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ATI, 0x4374,
>  DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ATI, 0x4375,
>                        quirk_msi_intx_disable_bug);
>
> -/*
> - * MSI does not work with the AMD RS780/RS880 internal graphics and HDMI audio
> - * devices unless the BIOS has initialized the nb_cntl.strap_msi_enable bit.
> - */
> -static void __init rs780_int_gfx_disable_msi(struct pci_dev *int_gfx_bridge)
> -{
> -       u32 nb_cntl;
> -
> -       if (!int_gfx_bridge->subordinate)
> -               return;
> -
> -       pci_bus_write_config_dword(int_gfx_bridge->bus, PCI_DEVFN(0, 0),
> -                                  0x60, 0);
> -       pci_bus_read_config_dword(int_gfx_bridge->bus, PCI_DEVFN(0, 0),
> -                                 0x64, &nb_cntl);
> -
> -       if (!(nb_cntl & BIT(10))) {
> -               dev_warn(&int_gfx_bridge->dev,
> -                        FW_WARN "RS780: MSI for internal graphics disabled\n");
> -               int_gfx_bridge->subordinate->bus_flags |= PCI_BUS_FLAGS_NO_MSI;
> -       }
> -}
> -
> -#define PCI_DEVICE_ID_AMD_RS780_P2P_INT_GFX    0x9602
> -
> -DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_AMD,
> -                       PCI_DEVICE_ID_AMD_RS780_P2P_INT_GFX,
> -                       rs780_int_gfx_disable_msi);
> -/* wrong vendor ID on M4A785TD motherboard: */
> -DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ASUSTEK,
> -                       PCI_DEVICE_ID_AMD_RS780_P2P_INT_GFX,
> -                       rs780_int_gfx_disable_msi);
> -
>  #endif /* CONFIG_PCI_MSI */
>
>  #ifdef CONFIG_PCI_IOV
>

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: [Regression, post-rc2] Commit a5ee4eb7541 breaks OpenGL on RS780 (was: Re: Linux 2.6.34-rc3)
  2010-04-01  1:13   ` Rafael J. Wysocki
  2010-04-01  2:19       ` Alex Deucher
@ 2010-04-01 16:29     ` Linus Torvalds
  2010-04-01 17:07         ` Alex Deucher
                         ` (2 more replies)
  1 sibling, 3 replies; 242+ messages in thread
From: Linus Torvalds @ 2010-04-01 16:29 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Linux Kernel Mailing List, Dave Airlie, dri-devel, Jesse Barnes,
	Linux PCI, Clemens Ladisch, Alex Deucher, stable, Greg KH



On Thu, 1 Apr 2010, Rafael J. Wysocki wrote:
> 
> OK, I've verified that partial revert (below) is sufficient.

Hmm. Through the DRM merge I just did, this area actually conflicted, and 
the resolved version is now

        if ((rdev->family >= CHIP_RV380) &&
            (!(rdev->flags & RADEON_IS_IGP))) {

which presumably also fixes your issue?

[ Side note: somebody in the DRM tree seems to be way too used to LISP, 
  and thinks that adding parenthesis always improves the code ;-]

However, I do suspect that we should probably revert the quirk regardless 
as being useless (ie it probably was related to those IGP chips that 
apparently don't do MSI anyway).

So the patch that reverts the quirk by Clemens (to replace it with 
disabling MSI entirely when the AMD NB doesn't accept them) seems to be a 
good idea regardless, since it's apparently not just about gfx. Jesse?

			Linus

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: [Regression, post-rc2] Commit a5ee4eb7541 breaks OpenGL on RS780  (was: Re: Linux 2.6.34-rc3)
  2010-04-01 16:29     ` Linus Torvalds
@ 2010-04-01 17:07         ` Alex Deucher
  2010-04-01 19:46       ` Rafael J. Wysocki
  2010-04-01 22:48       ` Jesse Barnes
  2 siblings, 0 replies; 242+ messages in thread
From: Alex Deucher @ 2010-04-01 17:07 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Rafael J. Wysocki, Linux PCI, Greg KH, Clemens Ladisch,
	Linux Kernel Mailing List, Jesse Barnes, Alex Deucher, dri-devel,
	stable

On Thu, Apr 1, 2010 at 12:29 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
>
> On Thu, 1 Apr 2010, Rafael J. Wysocki wrote:
>>
>> OK, I've verified that partial revert (below) is sufficient.
>
> Hmm. Through the DRM merge I just did, this area actually conflicted, and
> the resolved version is now
>
>        if ((rdev->family >= CHIP_RV380) &&
>            (!(rdev->flags & RADEON_IS_IGP))) {
>
> which presumably also fixes your issue?
>
> [ Side note: somebody in the DRM tree seems to be way too used to LISP,
>  and thinks that adding parenthesis always improves the code ;-]
>

heh, that's me.  habit I guess, just to be sure.

> However, I do suspect that we should probably revert the quirk regardless
> as being useless (ie it probably was related to those IGP chips that
> apparently don't do MSI anyway).
>
> So the patch that reverts the quirk by Clemens (to replace it with
> disabling MSI entirely when the AMD NB doesn't accept them) seems to be a
> good idea regardless, since it's apparently not just about gfx. Jesse?

Clemems' "PCI quirk: RS780/RS880: disable MSI completely" patch is the
right approach I think.  Note that it's only devices hung off the int
gfx pci to pci bridge that have broken MSI (gfx and audio).  MSI works
fine on the PCIE slots.  I have a similar patch for rs400 chips on bug
15626:
https://bugzilla.kernel.org/show_bug.cgi?id=15626

Alex

>
>                        Linus
>
> ------------------------------------------------------------------------------
> Download Intel&#174; Parallel Studio Eval
> Try the new software tools for yourself. Speed compiling, find bugs
> proactively, and fine-tune applications for parallel performance.
> See why Intel Parallel Studio got high marks during beta.
> http://p.sf.net/sfu/intel-sw-dev
> --
> _______________________________________________
> Dri-devel mailing list
> Dri-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/dri-devel
>

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: [Regression, post-rc2] Commit a5ee4eb7541 breaks OpenGL on RS780 (was: Re: Linux 2.6.34-rc3)
@ 2010-04-01 17:07         ` Alex Deucher
  0 siblings, 0 replies; 242+ messages in thread
From: Alex Deucher @ 2010-04-01 17:07 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Rafael J. Wysocki, Linux PCI, Greg KH, Clemens Ladisch,
	Linux Kernel Mailing List, Jesse Barnes, Alex Deucher, dri-devel,
	stable

On Thu, Apr 1, 2010 at 12:29 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
>
> On Thu, 1 Apr 2010, Rafael J. Wysocki wrote:
>>
>> OK, I've verified that partial revert (below) is sufficient.
>
> Hmm. Through the DRM merge I just did, this area actually conflicted, and
> the resolved version is now
>
>        if ((rdev->family >= CHIP_RV380) &&
>            (!(rdev->flags & RADEON_IS_IGP))) {
>
> which presumably also fixes your issue?
>
> [ Side note: somebody in the DRM tree seems to be way too used to LISP,
>  and thinks that adding parenthesis always improves the code ;-]
>

heh, that's me.  habit I guess, just to be sure.

> However, I do suspect that we should probably revert the quirk regardless
> as being useless (ie it probably was related to those IGP chips that
> apparently don't do MSI anyway).
>
> So the patch that reverts the quirk by Clemens (to replace it with
> disabling MSI entirely when the AMD NB doesn't accept them) seems to be a
> good idea regardless, since it's apparently not just about gfx. Jesse?

Clemems' "PCI quirk: RS780/RS880: disable MSI completely" patch is the
right approach I think.  Note that it's only devices hung off the int
gfx pci to pci bridge that have broken MSI (gfx and audio).  MSI works
fine on the PCIE slots.  I have a similar patch for rs400 chips on bug
15626:
https://bugzilla.kernel.org/show_bug.cgi?id=15626

Alex

>
>                        Linus
>
> ------------------------------------------------------------------------------
> Download Intel&#174; Parallel Studio Eval
> Try the new software tools for yourself. Speed compiling, find bugs
> proactively, and fine-tune applications for parallel performance.
> See why Intel Parallel Studio got high marks during beta.
> http://p.sf.net/sfu/intel-sw-dev
> --
> _______________________________________________
> Dri-devel mailing list
> Dri-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/dri-devel
>

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: [Regression, post-rc2] Commit a5ee4eb7541 breaks OpenGL on RS780 (was: Re: Linux 2.6.34-rc3)
  2010-04-01 17:07         ` Alex Deucher
  (?)
@ 2010-04-01 17:24         ` Linus Torvalds
  2010-04-01 17:50           ` [Regression, post-rc2] Commit a5ee4eb7541 breaks OpenGL on RS780 Clemens Ladisch
  2010-04-01 17:53             ` Alex Deucher
  -1 siblings, 2 replies; 242+ messages in thread
From: Linus Torvalds @ 2010-04-01 17:24 UTC (permalink / raw)
  To: Alex Deucher
  Cc: Rafael J. Wysocki, Linux PCI, Greg KH, Clemens Ladisch,
	Linux Kernel Mailing List, Jesse Barnes, Alex Deucher, dri-devel,
	stable



On Thu, 1 Apr 2010, Alex Deucher wrote:
> 
> Clemems' "PCI quirk: RS780/RS880: disable MSI completely" patch is the
> right approach I think.  Note that it's only devices hung off the int
> gfx pci to pci bridge that have broken MSI (gfx and audio).  MSI works
> fine on the PCIE slots.  I have a similar patch for rs400 chips on bug
> 15626:
> https://bugzilla.kernel.org/show_bug.cgi?id=15626

Hmm. Does 'pci_msi_enable' only cover regular PCI devices? Or will that 
pci_no_msi() quirk disable MSI for PCIE too? I think it will trigger for 
PCIE drivers too.

Put another way: it sounds like the quirk now disables MSI for all 
devices. Maybe there would some more targeted mode?

		Linus

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: [Regression, post-rc2] Commit a5ee4eb7541 breaks OpenGL on RS780
  2010-04-01 17:24         ` Linus Torvalds
@ 2010-04-01 17:50           ` Clemens Ladisch
  2010-04-01 17:53             ` Alex Deucher
  1 sibling, 0 replies; 242+ messages in thread
From: Clemens Ladisch @ 2010-04-01 17:50 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Alex Deucher, Rafael J. Wysocki, Linux PCI, Greg KH,
	Linux Kernel Mailing List, Jesse Barnes, dri-devel, stable

Linus Torvalds wrote:
> On Thu, 1 Apr 2010, Alex Deucher wrote:
> > Clemems' "PCI quirk: RS780/RS880: disable MSI completely" patch is the
> > right approach I think.  Note that it's only devices hung off the int
> > gfx pci to pci bridge that have broken MSI (gfx and audio).  MSI works
> > fine on the PCIE slots.  I have a similar patch for rs400 chips on bug
> > 15626:
> > https://bugzilla.kernel.org/show_bug.cgi?id=15626
> 
> Hmm. Does 'pci_msi_enable' only cover regular PCI devices? Or will that 
> pci_no_msi() quirk disable MSI for PCIE too?

A quirk that used pci_no_msi() would disable all MSI for all devices.
However, these patches (and that in bug 15626) use PCI_BUS_FLAGS_NO_MSI
so that only the internal GPU devices are affected.

That "completely" in my patch title should better read "unconditionally".


Regards,
Clemens

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: [Regression, post-rc2] Commit a5ee4eb7541 breaks OpenGL on RS780  (was: Re: Linux 2.6.34-rc3)
  2010-04-01 17:24         ` Linus Torvalds
@ 2010-04-01 17:53             ` Alex Deucher
  2010-04-01 17:53             ` Alex Deucher
  1 sibling, 0 replies; 242+ messages in thread
From: Alex Deucher @ 2010-04-01 17:53 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Rafael J. Wysocki, Linux PCI, Greg KH, Clemens Ladisch,
	Linux Kernel Mailing List, Jesse Barnes, dri-devel, stable

On Thu, Apr 1, 2010 at 1:24 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
>
> On Thu, 1 Apr 2010, Alex Deucher wrote:
>>
>> Clemems' "PCI quirk: RS780/RS880: disable MSI completely" patch is the
>> right approach I think.  Note that it's only devices hung off the int
>> gfx pci to pci bridge that have broken MSI (gfx and audio).  MSI works
>> fine on the PCIE slots.  I have a similar patch for rs400 chips on bug
>> 15626:
>> https://bugzilla.kernel.org/show_bug.cgi?id=15626
>
> Hmm. Does 'pci_msi_enable' only cover regular PCI devices? Or will that
> pci_no_msi() quirk disable MSI for PCIE too? I think it will trigger for
> PCIE drivers too.
>
> Put another way: it sounds like the quirk now disables MSI for all
> devices. Maybe there would some more targeted mode?
>

What I meant to say was MSI works fine on bridges other than the
bridge the internal gfx lives on.  quirk_disable_msi() just disables
MSI on the devices on that particular bridge as far as I understand
it, but I'm by no means an expert on the PCI code.  E.g., on my RS780
board, MSIs are only problematic on the integrated gfx chip.  MSIs
work fine on PCI/PCIE add-on cards and the integrated Ethernet.

Alex

>                Linus
>

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: [Regression, post-rc2] Commit a5ee4eb7541 breaks OpenGL on RS780 (was: Re: Linux 2.6.34-rc3)
@ 2010-04-01 17:53             ` Alex Deucher
  0 siblings, 0 replies; 242+ messages in thread
From: Alex Deucher @ 2010-04-01 17:53 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Rafael J. Wysocki, Linux PCI, Greg KH, Clemens Ladisch,
	Linux Kernel Mailing List, Jesse Barnes, dri-devel, stable

On Thu, Apr 1, 2010 at 1:24 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
>
> On Thu, 1 Apr 2010, Alex Deucher wrote:
>>
>> Clemems' "PCI quirk: RS780/RS880: disable MSI completely" patch is the
>> right approach I think.  Note that it's only devices hung off the int
>> gfx pci to pci bridge that have broken MSI (gfx and audio).  MSI works
>> fine on the PCIE slots.  I have a similar patch for rs400 chips on bug
>> 15626:
>> https://bugzilla.kernel.org/show_bug.cgi?id=15626
>
> Hmm. Does 'pci_msi_enable' only cover regular PCI devices? Or will that
> pci_no_msi() quirk disable MSI for PCIE too? I think it will trigger for
> PCIE drivers too.
>
> Put another way: it sounds like the quirk now disables MSI for all
> devices. Maybe there would some more targeted mode?
>

What I meant to say was MSI works fine on bridges other than the
bridge the internal gfx lives on.  quirk_disable_msi() just disables
MSI on the devices on that particular bridge as far as I understand
it, but I'm by no means an expert on the PCI code.  E.g., on my RS780
board, MSIs are only problematic on the integrated gfx chip.  MSIs
work fine on PCI/PCIE add-on cards and the integrated Ethernet.

Alex

>                Linus
>

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: [Regression, post-rc2] Commit a5ee4eb7541 breaks OpenGL on RS780 (was: Re: Linux 2.6.34-rc3)
  2010-04-01 16:29     ` Linus Torvalds
  2010-04-01 17:07         ` Alex Deucher
@ 2010-04-01 19:46       ` Rafael J. Wysocki
  2010-04-01 22:48       ` Jesse Barnes
  2 siblings, 0 replies; 242+ messages in thread
From: Rafael J. Wysocki @ 2010-04-01 19:46 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Linux Kernel Mailing List, Dave Airlie, dri-devel, Jesse Barnes,
	Linux PCI, Clemens Ladisch, Alex Deucher, stable, Greg KH

On Thursday 01 April 2010, Linus Torvalds wrote:
> 
> On Thu, 1 Apr 2010, Rafael J. Wysocki wrote:
> > 
> > OK, I've verified that partial revert (below) is sufficient.
> 
> Hmm. Through the DRM merge I just did, this area actually conflicted, and 
> the resolved version is now
> 
>         if ((rdev->family >= CHIP_RV380) &&
>             (!(rdev->flags & RADEON_IS_IGP))) {
> 
> which presumably also fixes your issue?

Yes, it does.

Rafael

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: [Regression, post-rc2] Commit a5ee4eb7541 breaks OpenGL on RS780 (was: Re: Linux 2.6.34-rc3)
  2010-04-01 17:53             ` Alex Deucher
  (?)
@ 2010-04-01 20:17             ` Linus Torvalds
  2010-04-01 20:23                 ` Alex Deucher
  -1 siblings, 1 reply; 242+ messages in thread
From: Linus Torvalds @ 2010-04-01 20:17 UTC (permalink / raw)
  To: Alex Deucher
  Cc: Rafael J. Wysocki, Linux PCI, Greg KH, Clemens Ladisch,
	Linux Kernel Mailing List, Jesse Barnes, dri-devel, stable



On Thu, 1 Apr 2010, Alex Deucher wrote:
> 
> What I meant to say was MSI works fine on bridges other than the
> bridge the internal gfx lives on.  quirk_disable_msi() just disables
> MSI on the devices on that particular bridge as far as I understand
> it, but I'm by no means an expert on the PCI code.

Yes, it disabled MSI only on devices under that bridge. But if it's the 
northbridge, that would be everything, no?

But I don't know what devices those

	PCI_VENDOR_ID_AMD, 0x9602,
	PCI_VENDOR_ID_ASUSTEK, 0x9602,

things are. If they are just a PCIE->PCI bridge rather than the root 
bridge, then everything looks fine to me.

			Linus

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: [Regression, post-rc2] Commit a5ee4eb7541 breaks OpenGL on RS780  (was: Re: Linux 2.6.34-rc3)
  2010-04-01 20:17             ` Linus Torvalds
@ 2010-04-01 20:23                 ` Alex Deucher
  0 siblings, 0 replies; 242+ messages in thread
From: Alex Deucher @ 2010-04-01 20:23 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Rafael J. Wysocki, Linux PCI, Greg KH, Clemens Ladisch,
	Linux Kernel Mailing List, Jesse Barnes, dri-devel, stable

On Thu, Apr 1, 2010 at 4:17 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
>
> On Thu, 1 Apr 2010, Alex Deucher wrote:
>>
>> What I meant to say was MSI works fine on bridges other than the
>> bridge the internal gfx lives on.  quirk_disable_msi() just disables
>> MSI on the devices on that particular bridge as far as I understand
>> it, but I'm by no means an expert on the PCI code.
>
> Yes, it disabled MSI only on devices under that bridge. But if it's the
> northbridge, that would be everything, no?
>
> But I don't know what devices those
>
>        PCI_VENDOR_ID_AMD, 0x9602,
>        PCI_VENDOR_ID_ASUSTEK, 0x9602,
>
> things are. If they are just a PCIE->PCI bridge rather than the root
> bridge, then everything looks fine to me.
>

Yup, those are just the pci to pci bridges used for the internal gfx.
Really there's only one, 0x9602, but some asus oem boards have the
vendor id wrong.

>                        Linus
>

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: [Regression, post-rc2] Commit a5ee4eb7541 breaks OpenGL on RS780 (was: Re: Linux 2.6.34-rc3)
@ 2010-04-01 20:23                 ` Alex Deucher
  0 siblings, 0 replies; 242+ messages in thread
From: Alex Deucher @ 2010-04-01 20:23 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Rafael J. Wysocki, Linux PCI, Greg KH, Clemens Ladisch,
	Linux Kernel Mailing List, Jesse Barnes, dri-devel, stable

On Thu, Apr 1, 2010 at 4:17 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
>
> On Thu, 1 Apr 2010, Alex Deucher wrote:
>>
>> What I meant to say was MSI works fine on bridges other than the
>> bridge the internal gfx lives on.  quirk_disable_msi() just disables
>> MSI on the devices on that particular bridge as far as I understand
>> it, but I'm by no means an expert on the PCI code.
>
> Yes, it disabled MSI only on devices under that bridge. But if it's the
> northbridge, that would be everything, no?
>
> But I don't know what devices those
>
>        PCI_VENDOR_ID_AMD, 0x9602,
>        PCI_VENDOR_ID_ASUSTEK, 0x9602,
>
> things are. If they are just a PCIE->PCI bridge rather than the root
> bridge, then everything looks fine to me.
>

Yup, those are just the pci to pci bridges used for the internal gfx.
Really there's only one, 0x9602, but some asus oem boards have the
vendor id wrong.

>                        Linus
>

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: [Regression, post-rc2] Commit a5ee4eb7541 breaks OpenGL on RS780 (was: Re: Linux 2.6.34-rc3)
  2010-04-01 15:01           ` Alex Deucher
@ 2010-04-01 20:28             ` Rafael J. Wysocki
  -1 siblings, 0 replies; 242+ messages in thread
From: Rafael J. Wysocki @ 2010-04-01 20:28 UTC (permalink / raw)
  To: Alex Deucher
  Cc: Clemens Ladisch, Linus Torvalds, Linux PCI, Greg KH,
	Linux Kernel Mailing List, Jesse Barnes, dri-devel, stable,
	Dave Airlie

On Thursday 01 April 2010, Alex Deucher wrote:
> On Thu, Apr 1, 2010 at 2:36 AM, Clemens Ladisch <clemens@ladisch.de> wrote:
> > Alex Deucher wrote:
> >> On Wed, Mar 31, 2010 at 9:13 PM, Rafael J. Wysocki <rjw@sisk.pl> wrote:
> >>> On Tuesday 30 March 2010, Rafael J. Wysocki wrote:
> >>>> >       PCI quirk: RS780/RS880: work around missing MSI initialization
> >>>>
> >>>> This one (commit a5ee4eb7541) broke OpenGL acceleration on my new test box
> >>>> which happens to have a RS780.
> >
> > So it's better to disable MSI unconditionally.
> >
> > Rafael, can you check if MSI works for the HDMI audio device?
> > (I'd guess it doesn't.)
> >
> >> I also have the attached patch queued in via Dave's tree to disable
> >> MSI on all IGP chips for the time being.
> >
> > This disables MSI only for the graphics device.  I'd prefer to have
> > the quirk on its bridge so that MSI gets disabled for the HDMI audio
> > device too, to avoid having to duplicate this quirk in the snd-hda-intel
> > driver.
> >
> > ==========
> >
> > PCI quirk: RS780/RS880: disable MSI completely
> >
> > The missing initialization of the nb_cntl.strap_msi_enable does not seem
> > to be the only problem that prevents MSI, so that quirk is not
> > sufficient to enable MSI on all machines.  To be safe, unconditionally
> > disable MSI for the internal graphics and HDMI audio on these chipsets.
> >
> > Signed-off-by: Clemens Ladisch <clemens@ladisch.de>
> 
> Works fine here.
> 
> Tested-by: Alex Deucher <alexdeucher@gmail.com>

Unfortunately it doesn't work for me without the

if ((rdev->family >= CHIP_RV380) &&
            (!(rdev->flags & RADEON_IS_IGP)))

radeon quirk.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: [Regression, post-rc2] Commit a5ee4eb7541 breaks OpenGL on RS780 (was: Re: Linux 2.6.34-rc3)
@ 2010-04-01 20:28             ` Rafael J. Wysocki
  0 siblings, 0 replies; 242+ messages in thread
From: Rafael J. Wysocki @ 2010-04-01 20:28 UTC (permalink / raw)
  To: Alex Deucher
  Cc: Linux PCI, Greg KH, Clemens Ladisch, Linux Kernel Mailing List,
	Jesse Barnes, dri-devel, Linus Torvalds, stable

On Thursday 01 April 2010, Alex Deucher wrote:
> On Thu, Apr 1, 2010 at 2:36 AM, Clemens Ladisch <clemens@ladisch.de> wrote:
> > Alex Deucher wrote:
> >> On Wed, Mar 31, 2010 at 9:13 PM, Rafael J. Wysocki <rjw@sisk.pl> wrote:
> >>> On Tuesday 30 March 2010, Rafael J. Wysocki wrote:
> >>>> >       PCI quirk: RS780/RS880: work around missing MSI initialization
> >>>>
> >>>> This one (commit a5ee4eb7541) broke OpenGL acceleration on my new test box
> >>>> which happens to have a RS780.
> >
> > So it's better to disable MSI unconditionally.
> >
> > Rafael, can you check if MSI works for the HDMI audio device?
> > (I'd guess it doesn't.)
> >
> >> I also have the attached patch queued in via Dave's tree to disable
> >> MSI on all IGP chips for the time being.
> >
> > This disables MSI only for the graphics device.  I'd prefer to have
> > the quirk on its bridge so that MSI gets disabled for the HDMI audio
> > device too, to avoid having to duplicate this quirk in the snd-hda-intel
> > driver.
> >
> > ==========
> >
> > PCI quirk: RS780/RS880: disable MSI completely
> >
> > The missing initialization of the nb_cntl.strap_msi_enable does not seem
> > to be the only problem that prevents MSI, so that quirk is not
> > sufficient to enable MSI on all machines.  To be safe, unconditionally
> > disable MSI for the internal graphics and HDMI audio on these chipsets.
> >
> > Signed-off-by: Clemens Ladisch <clemens@ladisch.de>
> 
> Works fine here.
> 
> Tested-by: Alex Deucher <alexdeucher@gmail.com>

Unfortunately it doesn't work for me without the

if ((rdev->family >= CHIP_RV380) &&
            (!(rdev->flags & RADEON_IS_IGP)))

radeon quirk.

Thanks,
Rafael

------------------------------------------------------------------------------
Download Intel&#174; Parallel Studio Eval
Try the new software tools for yourself. Speed compiling, find bugs
proactively, and fine-tune applications for parallel performance.
See why Intel Parallel Studio got high marks during beta.
http://p.sf.net/sfu/intel-sw-dev
--

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: [Regression, post-rc2] Commit a5ee4eb7541 breaks OpenGL on RS780  (was: Re: Linux 2.6.34-rc3)
  2010-04-01 20:28             ` Rafael J. Wysocki
@ 2010-04-01 20:39               ` Alex Deucher
  -1 siblings, 0 replies; 242+ messages in thread
From: Alex Deucher @ 2010-04-01 20:39 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Clemens Ladisch, Linus Torvalds, Linux PCI, Greg KH,
	Linux Kernel Mailing List, Jesse Barnes, dri-devel, stable,
	Dave Airlie

On Thu, Apr 1, 2010 at 4:28 PM, Rafael J. Wysocki <rjw@sisk.pl> wrote:
> On Thursday 01 April 2010, Alex Deucher wrote:
>> On Thu, Apr 1, 2010 at 2:36 AM, Clemens Ladisch <clemens@ladisch.de> wrote:
>> > Alex Deucher wrote:
>> >> On Wed, Mar 31, 2010 at 9:13 PM, Rafael J. Wysocki <rjw@sisk.pl> wrote:
>> >>> On Tuesday 30 March 2010, Rafael J. Wysocki wrote:
>> >>>> >       PCI quirk: RS780/RS880: work around missing MSI initialization
>> >>>>
>> >>>> This one (commit a5ee4eb7541) broke OpenGL acceleration on my new test box
>> >>>> which happens to have a RS780.
>> >
>> > So it's better to disable MSI unconditionally.
>> >
>> > Rafael, can you check if MSI works for the HDMI audio device?
>> > (I'd guess it doesn't.)
>> >
>> >> I also have the attached patch queued in via Dave's tree to disable
>> >> MSI on all IGP chips for the time being.
>> >
>> > This disables MSI only for the graphics device.  I'd prefer to have
>> > the quirk on its bridge so that MSI gets disabled for the HDMI audio
>> > device too, to avoid having to duplicate this quirk in the snd-hda-intel
>> > driver.
>> >
>> > ==========
>> >
>> > PCI quirk: RS780/RS880: disable MSI completely
>> >
>> > The missing initialization of the nb_cntl.strap_msi_enable does not seem
>> > to be the only problem that prevents MSI, so that quirk is not
>> > sufficient to enable MSI on all machines.  To be safe, unconditionally
>> > disable MSI for the internal graphics and HDMI audio on these chipsets.
>> >
>> > Signed-off-by: Clemens Ladisch <clemens@ladisch.de>
>>
>> Works fine here.
>>
>> Tested-by: Alex Deucher <alexdeucher@gmail.com>
>
> Unfortunately it doesn't work for me without the
>
> if ((rdev->family >= CHIP_RV380) &&
>            (!(rdev->flags & RADEON_IS_IGP)))
>
> radeon quirk.

what are your pci ids?

Alex

>
> Thanks,
> Rafael
>

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: [Regression, post-rc2] Commit a5ee4eb7541 breaks OpenGL on RS780 (was: Re: Linux 2.6.34-rc3)
@ 2010-04-01 20:39               ` Alex Deucher
  0 siblings, 0 replies; 242+ messages in thread
From: Alex Deucher @ 2010-04-01 20:39 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Clemens Ladisch, Linus Torvalds, Linux PCI, Greg KH,
	Linux Kernel Mailing List, Jesse Barnes, dri-devel, stable,
	Dave Airlie

On Thu, Apr 1, 2010 at 4:28 PM, Rafael J. Wysocki <rjw@sisk.pl> wrote:
> On Thursday 01 April 2010, Alex Deucher wrote:
>> On Thu, Apr 1, 2010 at 2:36 AM, Clemens Ladisch <clemens@ladisch.de> wrote:
>> > Alex Deucher wrote:
>> >> On Wed, Mar 31, 2010 at 9:13 PM, Rafael J. Wysocki <rjw@sisk.pl> wrote:
>> >>> On Tuesday 30 March 2010, Rafael J. Wysocki wrote:
>> >>>> >       PCI quirk: RS780/RS880: work around missing MSI initialization
>> >>>>
>> >>>> This one (commit a5ee4eb7541) broke OpenGL acceleration on my new test box
>> >>>> which happens to have a RS780.
>> >
>> > So it's better to disable MSI unconditionally.
>> >
>> > Rafael, can you check if MSI works for the HDMI audio device?
>> > (I'd guess it doesn't.)
>> >
>> >> I also have the attached patch queued in via Dave's tree to disable
>> >> MSI on all IGP chips for the time being.
>> >
>> > This disables MSI only for the graphics device.  I'd prefer to have
>> > the quirk on its bridge so that MSI gets disabled for the HDMI audio
>> > device too, to avoid having to duplicate this quirk in the snd-hda-intel
>> > driver.
>> >
>> > ==========
>> >
>> > PCI quirk: RS780/RS880: disable MSI completely
>> >
>> > The missing initialization of the nb_cntl.strap_msi_enable does not seem
>> > to be the only problem that prevents MSI, so that quirk is not
>> > sufficient to enable MSI on all machines.  To be safe, unconditionally
>> > disable MSI for the internal graphics and HDMI audio on these chipsets.
>> >
>> > Signed-off-by: Clemens Ladisch <clemens@ladisch.de>
>>
>> Works fine here.
>>
>> Tested-by: Alex Deucher <alexdeucher@gmail.com>
>
> Unfortunately it doesn't work for me without the
>
> if ((rdev->family >= CHIP_RV380) &&
>            (!(rdev->flags & RADEON_IS_IGP)))
>
> radeon quirk.

what are your pci ids?

Alex

>
> Thanks,
> Rafael
>

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: [Regression, post-rc2] Commit a5ee4eb7541 breaks OpenGL on RS780 (was: Re: Linux 2.6.34-rc3)
  2010-04-01 20:39               ` Alex Deucher
  (?)
@ 2010-04-01 20:48               ` Rafael J. Wysocki
  2010-04-01 21:00                   ` Alex Deucher
  2010-04-01 21:01                   ` Alex Deucher
  -1 siblings, 2 replies; 242+ messages in thread
From: Rafael J. Wysocki @ 2010-04-01 20:48 UTC (permalink / raw)
  To: Alex Deucher
  Cc: Clemens Ladisch, Linus Torvalds, Linux PCI, Greg KH,
	Linux Kernel Mailing List, Jesse Barnes, dri-devel, stable,
	Dave Airlie

On Thursday 01 April 2010, Alex Deucher wrote:
> On Thu, Apr 1, 2010 at 4:28 PM, Rafael J. Wysocki <rjw@sisk.pl> wrote:
> > On Thursday 01 April 2010, Alex Deucher wrote:
> >> On Thu, Apr 1, 2010 at 2:36 AM, Clemens Ladisch <clemens@ladisch.de> wrote:
> >> > Alex Deucher wrote:
> >> >> On Wed, Mar 31, 2010 at 9:13 PM, Rafael J. Wysocki <rjw@sisk.pl> wrote:
> >> >>> On Tuesday 30 March 2010, Rafael J. Wysocki wrote:
> >> >>>> >       PCI quirk: RS780/RS880: work around missing MSI initialization
> >> >>>>
> >> >>>> This one (commit a5ee4eb7541) broke OpenGL acceleration on my new test box
> >> >>>> which happens to have a RS780.
> >> >
> >> > So it's better to disable MSI unconditionally.
> >> >
> >> > Rafael, can you check if MSI works for the HDMI audio device?
> >> > (I'd guess it doesn't.)
> >> >
> >> >> I also have the attached patch queued in via Dave's tree to disable
> >> >> MSI on all IGP chips for the time being.
> >> >
> >> > This disables MSI only for the graphics device.  I'd prefer to have
> >> > the quirk on its bridge so that MSI gets disabled for the HDMI audio
> >> > device too, to avoid having to duplicate this quirk in the snd-hda-intel
> >> > driver.
> >> >
> >> > ==========
> >> >
> >> > PCI quirk: RS780/RS880: disable MSI completely
> >> >
> >> > The missing initialization of the nb_cntl.strap_msi_enable does not seem
> >> > to be the only problem that prevents MSI, so that quirk is not
> >> > sufficient to enable MSI on all machines.  To be safe, unconditionally
> >> > disable MSI for the internal graphics and HDMI audio on these chipsets.
> >> >
> >> > Signed-off-by: Clemens Ladisch <clemens@ladisch.de>
> >>
> >> Works fine here.
> >>
> >> Tested-by: Alex Deucher <alexdeucher@gmail.com>
> >
> > Unfortunately it doesn't work for me without the
> >
> > if ((rdev->family >= CHIP_RV380) &&
> >            (!(rdev->flags & RADEON_IS_IGP)))
> >
> > radeon quirk.
> 
> what are your pci ids?

1022:960b

I guess 1022 is AMD.

OK, I'll try to add that.

Rafael

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: [Regression, post-rc2] Commit a5ee4eb7541 breaks OpenGL on RS780  (was: Re: Linux 2.6.34-rc3)
  2010-04-01 20:48               ` Rafael J. Wysocki
@ 2010-04-01 21:00                   ` Alex Deucher
  2010-04-01 21:01                   ` Alex Deucher
  1 sibling, 0 replies; 242+ messages in thread
From: Alex Deucher @ 2010-04-01 21:00 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Clemens Ladisch, Linus Torvalds, Linux PCI, Greg KH,
	Linux Kernel Mailing List, Jesse Barnes, dri-devel, stable,
	Dave Airlie

On Thu, Apr 1, 2010 at 4:48 PM, Rafael J. Wysocki <rjw@sisk.pl> wrote:
> On Thursday 01 April 2010, Alex Deucher wrote:
>> On Thu, Apr 1, 2010 at 4:28 PM, Rafael J. Wysocki <rjw@sisk.pl> wrote:
>> > On Thursday 01 April 2010, Alex Deucher wrote:
>> >> On Thu, Apr 1, 2010 at 2:36 AM, Clemens Ladisch <clemens@ladisch.de> wrote:
>> >> > Alex Deucher wrote:
>> >> >> On Wed, Mar 31, 2010 at 9:13 PM, Rafael J. Wysocki <rjw@sisk.pl> wrote:
>> >> >>> On Tuesday 30 March 2010, Rafael J. Wysocki wrote:
>> >> >>>> >       PCI quirk: RS780/RS880: work around missing MSI initialization
>> >> >>>>
>> >> >>>> This one (commit a5ee4eb7541) broke OpenGL acceleration on my new test box
>> >> >>>> which happens to have a RS780.
>> >> >
>> >> > So it's better to disable MSI unconditionally.
>> >> >
>> >> > Rafael, can you check if MSI works for the HDMI audio device?
>> >> > (I'd guess it doesn't.)
>> >> >
>> >> >> I also have the attached patch queued in via Dave's tree to disable
>> >> >> MSI on all IGP chips for the time being.
>> >> >
>> >> > This disables MSI only for the graphics device.  I'd prefer to have
>> >> > the quirk on its bridge so that MSI gets disabled for the HDMI audio
>> >> > device too, to avoid having to duplicate this quirk in the snd-hda-intel
>> >> > driver.
>> >> >
>> >> > ==========
>> >> >
>> >> > PCI quirk: RS780/RS880: disable MSI completely
>> >> >
>> >> > The missing initialization of the nb_cntl.strap_msi_enable does not seem
>> >> > to be the only problem that prevents MSI, so that quirk is not
>> >> > sufficient to enable MSI on all machines.  To be safe, unconditionally
>> >> > disable MSI for the internal graphics and HDMI audio on these chipsets.
>> >> >
>> >> > Signed-off-by: Clemens Ladisch <clemens@ladisch.de>
>> >>
>> >> Works fine here.
>> >>
>> >> Tested-by: Alex Deucher <alexdeucher@gmail.com>
>> >
>> > Unfortunately it doesn't work for me without the
>> >
>> > if ((rdev->family >= CHIP_RV380) &&
>> >            (!(rdev->flags & RADEON_IS_IGP)))
>> >
>> > radeon quirk.
>>
>> what are your pci ids?
>
> 1022:960b
>
> I guess 1022 is AMD.
>
> OK, I'll try to add that.

0x960b won't affect the internal gfx.  That bridge is for the pcie x16 gfx slot.

0x9600                       Host bridge
0x9602                       Internal GFX PCI-PCI bridge ID
0x9603                       External GFX - port 0
0x960B                       External GFX - port 1
0x9604                       PCI-PCI bridge - Port 0
0x9605                       PCI-PCI bridge - Port 1
0x9606                       PCI-PCI bridge - Port 2
0x9607                       PCI-PCI bridge - Port 3
0x9608                       PCI-PCI bridge - Port 4
0x9609                       PCI-PCI bridge - Port 5
0x960A                       PCI-PCI bridge (SB)
0x960F                       HD Audio controller
0x791A                       HDMI Audio codec

Alex

>
> Rafael
>

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: [Regression, post-rc2] Commit a5ee4eb7541 breaks OpenGL on RS780 (was: Re: Linux 2.6.34-rc3)
@ 2010-04-01 21:00                   ` Alex Deucher
  0 siblings, 0 replies; 242+ messages in thread
From: Alex Deucher @ 2010-04-01 21:00 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Clemens Ladisch, Linus Torvalds, Linux PCI, Greg KH,
	Linux Kernel Mailing List, Jesse Barnes, dri-devel, stable,
	Dave Airlie

On Thu, Apr 1, 2010 at 4:48 PM, Rafael J. Wysocki <rjw@sisk.pl> wrote:
> On Thursday 01 April 2010, Alex Deucher wrote:
>> On Thu, Apr 1, 2010 at 4:28 PM, Rafael J. Wysocki <rjw@sisk.pl> wrote:
>> > On Thursday 01 April 2010, Alex Deucher wrote:
>> >> On Thu, Apr 1, 2010 at 2:36 AM, Clemens Ladisch <clemens@ladisch.de> wrote:
>> >> > Alex Deucher wrote:
>> >> >> On Wed, Mar 31, 2010 at 9:13 PM, Rafael J. Wysocki <rjw@sisk.pl> wrote:
>> >> >>> On Tuesday 30 March 2010, Rafael J. Wysocki wrote:
>> >> >>>> >       PCI quirk: RS780/RS880: work around missing MSI initialization
>> >> >>>>
>> >> >>>> This one (commit a5ee4eb7541) broke OpenGL acceleration on my new test box
>> >> >>>> which happens to have a RS780.
>> >> >
>> >> > So it's better to disable MSI unconditionally.
>> >> >
>> >> > Rafael, can you check if MSI works for the HDMI audio device?
>> >> > (I'd guess it doesn't.)
>> >> >
>> >> >> I also have the attached patch queued in via Dave's tree to disable
>> >> >> MSI on all IGP chips for the time being.
>> >> >
>> >> > This disables MSI only for the graphics device.  I'd prefer to have
>> >> > the quirk on its bridge so that MSI gets disabled for the HDMI audio
>> >> > device too, to avoid having to duplicate this quirk in the snd-hda-intel
>> >> > driver.
>> >> >
>> >> > ==========
>> >> >
>> >> > PCI quirk: RS780/RS880: disable MSI completely
>> >> >
>> >> > The missing initialization of the nb_cntl.strap_msi_enable does not seem
>> >> > to be the only problem that prevents MSI, so that quirk is not
>> >> > sufficient to enable MSI on all machines.  To be safe, unconditionally
>> >> > disable MSI for the internal graphics and HDMI audio on these chipsets.
>> >> >
>> >> > Signed-off-by: Clemens Ladisch <clemens@ladisch.de>
>> >>
>> >> Works fine here.
>> >>
>> >> Tested-by: Alex Deucher <alexdeucher@gmail.com>
>> >
>> > Unfortunately it doesn't work for me without the
>> >
>> > if ((rdev->family >= CHIP_RV380) &&
>> >            (!(rdev->flags & RADEON_IS_IGP)))
>> >
>> > radeon quirk.
>>
>> what are your pci ids?
>
> 1022:960b
>
> I guess 1022 is AMD.
>
> OK, I'll try to add that.

0x960b won't affect the internal gfx.  That bridge is for the pcie x16 gfx slot.

0x9600                       Host bridge
0x9602                       Internal GFX PCI-PCI bridge ID
0x9603                       External GFX - port 0
0x960B                       External GFX - port 1
0x9604                       PCI-PCI bridge - Port 0
0x9605                       PCI-PCI bridge - Port 1
0x9606                       PCI-PCI bridge - Port 2
0x9607                       PCI-PCI bridge - Port 3
0x9608                       PCI-PCI bridge - Port 4
0x9609                       PCI-PCI bridge - Port 5
0x960A                       PCI-PCI bridge (SB)
0x960F                       HD Audio controller
0x791A                       HDMI Audio codec

Alex

>
> Rafael
>

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: [Regression, post-rc2] Commit a5ee4eb7541 breaks OpenGL on RS780  (was: Re: Linux 2.6.34-rc3)
  2010-04-01 20:48               ` Rafael J. Wysocki
@ 2010-04-01 21:01                   ` Alex Deucher
  2010-04-01 21:01                   ` Alex Deucher
  1 sibling, 0 replies; 242+ messages in thread
From: Alex Deucher @ 2010-04-01 21:01 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Clemens Ladisch, Linus Torvalds, Linux PCI, Greg KH,
	Linux Kernel Mailing List, Jesse Barnes, dri-devel, stable,
	Dave Airlie

On Thu, Apr 1, 2010 at 4:48 PM, Rafael J. Wysocki <rjw@sisk.pl> wrote:
> On Thursday 01 April 2010, Alex Deucher wrote:
>> On Thu, Apr 1, 2010 at 4:28 PM, Rafael J. Wysocki <rjw@sisk.pl> wrote:
>> > On Thursday 01 April 2010, Alex Deucher wrote:
>> >> On Thu, Apr 1, 2010 at 2:36 AM, Clemens Ladisch <clemens@ladisch.de> wrote:
>> >> > Alex Deucher wrote:
>> >> >> On Wed, Mar 31, 2010 at 9:13 PM, Rafael J. Wysocki <rjw@sisk.pl> wrote:
>> >> >>> On Tuesday 30 March 2010, Rafael J. Wysocki wrote:
>> >> >>>> >       PCI quirk: RS780/RS880: work around missing MSI initialization
>> >> >>>>
>> >> >>>> This one (commit a5ee4eb7541) broke OpenGL acceleration on my new test box
>> >> >>>> which happens to have a RS780.
>> >> >
>> >> > So it's better to disable MSI unconditionally.
>> >> >
>> >> > Rafael, can you check if MSI works for the HDMI audio device?
>> >> > (I'd guess it doesn't.)
>> >> >
>> >> >> I also have the attached patch queued in via Dave's tree to disable
>> >> >> MSI on all IGP chips for the time being.
>> >> >
>> >> > This disables MSI only for the graphics device.  I'd prefer to have
>> >> > the quirk on its bridge so that MSI gets disabled for the HDMI audio
>> >> > device too, to avoid having to duplicate this quirk in the snd-hda-intel
>> >> > driver.
>> >> >
>> >> > ==========
>> >> >
>> >> > PCI quirk: RS780/RS880: disable MSI completely
>> >> >
>> >> > The missing initialization of the nb_cntl.strap_msi_enable does not seem
>> >> > to be the only problem that prevents MSI, so that quirk is not
>> >> > sufficient to enable MSI on all machines.  To be safe, unconditionally
>> >> > disable MSI for the internal graphics and HDMI audio on these chipsets.
>> >> >
>> >> > Signed-off-by: Clemens Ladisch <clemens@ladisch.de>
>> >>
>> >> Works fine here.
>> >>
>> >> Tested-by: Alex Deucher <alexdeucher@gmail.com>
>> >
>> > Unfortunately it doesn't work for me without the
>> >
>> > if ((rdev->family >= CHIP_RV380) &&
>> >            (!(rdev->flags & RADEON_IS_IGP)))
>> >
>> > radeon quirk.
>>
>> what are your pci ids?
>
> 1022:960b
>
> I guess 1022 is AMD.
>
> OK, I'll try to add that.

It's possible your oem has the wrong vendor id for the 0x9602 bridge.

Alex

>
> Rafael
>

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: [Regression, post-rc2] Commit a5ee4eb7541 breaks OpenGL on RS780 (was: Re: Linux 2.6.34-rc3)
@ 2010-04-01 21:01                   ` Alex Deucher
  0 siblings, 0 replies; 242+ messages in thread
From: Alex Deucher @ 2010-04-01 21:01 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Clemens Ladisch, Linus Torvalds, Linux PCI, Greg KH,
	Linux Kernel Mailing List, Jesse Barnes, dri-devel, stable,
	Dave Airlie

On Thu, Apr 1, 2010 at 4:48 PM, Rafael J. Wysocki <rjw@sisk.pl> wrote:
> On Thursday 01 April 2010, Alex Deucher wrote:
>> On Thu, Apr 1, 2010 at 4:28 PM, Rafael J. Wysocki <rjw@sisk.pl> wrote:
>> > On Thursday 01 April 2010, Alex Deucher wrote:
>> >> On Thu, Apr 1, 2010 at 2:36 AM, Clemens Ladisch <clemens@ladisch.de> wrote:
>> >> > Alex Deucher wrote:
>> >> >> On Wed, Mar 31, 2010 at 9:13 PM, Rafael J. Wysocki <rjw@sisk.pl> wrote:
>> >> >>> On Tuesday 30 March 2010, Rafael J. Wysocki wrote:
>> >> >>>> >       PCI quirk: RS780/RS880: work around missing MSI initialization
>> >> >>>>
>> >> >>>> This one (commit a5ee4eb7541) broke OpenGL acceleration on my new test box
>> >> >>>> which happens to have a RS780.
>> >> >
>> >> > So it's better to disable MSI unconditionally.
>> >> >
>> >> > Rafael, can you check if MSI works for the HDMI audio device?
>> >> > (I'd guess it doesn't.)
>> >> >
>> >> >> I also have the attached patch queued in via Dave's tree to disable
>> >> >> MSI on all IGP chips for the time being.
>> >> >
>> >> > This disables MSI only for the graphics device.  I'd prefer to have
>> >> > the quirk on its bridge so that MSI gets disabled for the HDMI audio
>> >> > device too, to avoid having to duplicate this quirk in the snd-hda-intel
>> >> > driver.
>> >> >
>> >> > ==========
>> >> >
>> >> > PCI quirk: RS780/RS880: disable MSI completely
>> >> >
>> >> > The missing initialization of the nb_cntl.strap_msi_enable does not seem
>> >> > to be the only problem that prevents MSI, so that quirk is not
>> >> > sufficient to enable MSI on all machines.  To be safe, unconditionally
>> >> > disable MSI for the internal graphics and HDMI audio on these chipsets.
>> >> >
>> >> > Signed-off-by: Clemens Ladisch <clemens@ladisch.de>
>> >>
>> >> Works fine here.
>> >>
>> >> Tested-by: Alex Deucher <alexdeucher@gmail.com>
>> >
>> > Unfortunately it doesn't work for me without the
>> >
>> > if ((rdev->family >= CHIP_RV380) &&
>> >            (!(rdev->flags & RADEON_IS_IGP)))
>> >
>> > radeon quirk.
>>
>> what are your pci ids?
>
> 1022:960b
>
> I guess 1022 is AMD.
>
> OK, I'll try to add that.

It's possible your oem has the wrong vendor id for the 0x9602 bridge.

Alex

>
> Rafael
>

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: [Regression, post-rc2] Commit a5ee4eb7541 breaks OpenGL on RS780 (was: Re: Linux 2.6.34-rc3)
  2010-04-01 21:01                   ` Alex Deucher
  (?)
@ 2010-04-01 21:08                   ` Rafael J. Wysocki
  2010-04-01 21:13                       ` Alex Deucher
  -1 siblings, 1 reply; 242+ messages in thread
From: Rafael J. Wysocki @ 2010-04-01 21:08 UTC (permalink / raw)
  To: Alex Deucher
  Cc: Clemens Ladisch, Linus Torvalds, Linux PCI, Greg KH,
	Linux Kernel Mailing List, Jesse Barnes, dri-devel, stable,
	Dave Airlie

On Thursday 01 April 2010, Alex Deucher wrote:
> On Thu, Apr 1, 2010 at 4:48 PM, Rafael J. Wysocki <rjw@sisk.pl> wrote:
> > On Thursday 01 April 2010, Alex Deucher wrote:
> >> On Thu, Apr 1, 2010 at 4:28 PM, Rafael J. Wysocki <rjw@sisk.pl> wrote:
> >> > On Thursday 01 April 2010, Alex Deucher wrote:
> >> >> On Thu, Apr 1, 2010 at 2:36 AM, Clemens Ladisch <clemens@ladisch.de> wrote:
> >> >> > Alex Deucher wrote:
> >> >> >> On Wed, Mar 31, 2010 at 9:13 PM, Rafael J. Wysocki <rjw@sisk.pl> wrote:
> >> >> >>> On Tuesday 30 March 2010, Rafael J. Wysocki wrote:
> >> >> >>>> >       PCI quirk: RS780/RS880: work around missing MSI initialization
> >> >> >>>>
> >> >> >>>> This one (commit a5ee4eb7541) broke OpenGL acceleration on my new test box
> >> >> >>>> which happens to have a RS780.
> >> >> >
> >> >> > So it's better to disable MSI unconditionally.
> >> >> >
> >> >> > Rafael, can you check if MSI works for the HDMI audio device?
> >> >> > (I'd guess it doesn't.)
> >> >> >
> >> >> >> I also have the attached patch queued in via Dave's tree to disable
> >> >> >> MSI on all IGP chips for the time being.
> >> >> >
> >> >> > This disables MSI only for the graphics device.  I'd prefer to have
> >> >> > the quirk on its bridge so that MSI gets disabled for the HDMI audio
> >> >> > device too, to avoid having to duplicate this quirk in the snd-hda-intel
> >> >> > driver.
> >> >> >
> >> >> > ==========
> >> >> >
> >> >> > PCI quirk: RS780/RS880: disable MSI completely
> >> >> >
> >> >> > The missing initialization of the nb_cntl.strap_msi_enable does not seem
> >> >> > to be the only problem that prevents MSI, so that quirk is not
> >> >> > sufficient to enable MSI on all machines.  To be safe, unconditionally
> >> >> > disable MSI for the internal graphics and HDMI audio on these chipsets.
> >> >> >
> >> >> > Signed-off-by: Clemens Ladisch <clemens@ladisch.de>
> >> >>
> >> >> Works fine here.
> >> >>
> >> >> Tested-by: Alex Deucher <alexdeucher@gmail.com>
> >> >
> >> > Unfortunately it doesn't work for me without the
> >> >
> >> > if ((rdev->family >= CHIP_RV380) &&
> >> >            (!(rdev->flags & RADEON_IS_IGP)))
> >> >
> >> > radeon quirk.
> >>
> >> what are your pci ids?
> >
> > 1022:960b
> >
> > I guess 1022 is AMD.
> >
> > OK, I'll try to add that.
> 
> It's possible your oem has the wrong vendor id for the 0x9602 bridge.

Yes, the patch below works.

Thanks,
Rafael


---
 drivers/gpu/drm/radeon/radeon_irq_kms.c |    3 --
 drivers/pci/quirks.c                    |   36 ++------------------------------
 2 files changed, 4 insertions(+), 35 deletions(-)

Index: linux-2.6/drivers/pci/quirks.c
===================================================================
--- linux-2.6.orig/drivers/pci/quirks.c
+++ linux-2.6/drivers/pci/quirks.c
@@ -2123,6 +2123,9 @@ static void __devinit quirk_disable_msi(
 	}
 }
 DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_8131_BRIDGE, quirk_disable_msi);
+DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_AMD, 0x9602, quirk_disable_msi);
+DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ASUSTEK, 0x9602, quirk_disable_msi);
+DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_AI, 0x9602, quirk_disable_msi);
 DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_VIA, 0xa238, quirk_disable_msi);
 
 /* Go through the list of Hypertransport capabilities and
@@ -2495,39 +2498,6 @@ DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_AT
 DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ATI, 0x4375,
 			quirk_msi_intx_disable_bug);
 
-/*
- * MSI does not work with the AMD RS780/RS880 internal graphics and HDMI audio
- * devices unless the BIOS has initialized the nb_cntl.strap_msi_enable bit.
- */
-static void __init rs780_int_gfx_disable_msi(struct pci_dev *int_gfx_bridge)
-{
-	u32 nb_cntl;
-
-	if (!int_gfx_bridge->subordinate)
-		return;
-
-	pci_bus_write_config_dword(int_gfx_bridge->bus, PCI_DEVFN(0, 0),
-				   0x60, 0);
-	pci_bus_read_config_dword(int_gfx_bridge->bus, PCI_DEVFN(0, 0),
-				  0x64, &nb_cntl);
-
-	if (!(nb_cntl & BIT(10))) {
-		dev_warn(&int_gfx_bridge->dev,
-			 FW_WARN "RS780: MSI for internal graphics disabled\n");
-		int_gfx_bridge->subordinate->bus_flags |= PCI_BUS_FLAGS_NO_MSI;
-	}
-}
-
-#define PCI_DEVICE_ID_AMD_RS780_P2P_INT_GFX	0x9602
-
-DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_AMD,
-			PCI_DEVICE_ID_AMD_RS780_P2P_INT_GFX,
-			rs780_int_gfx_disable_msi);
-/* wrong vendor ID on M4A785TD motherboard: */
-DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ASUSTEK,
-			PCI_DEVICE_ID_AMD_RS780_P2P_INT_GFX,
-			rs780_int_gfx_disable_msi);
-
 #endif /* CONFIG_PCI_MSI */
 
 #ifdef CONFIG_PCI_IOV
Index: linux-2.6/drivers/gpu/drm/radeon/radeon_irq_kms.c
===================================================================
--- linux-2.6.orig/drivers/gpu/drm/radeon/radeon_irq_kms.c
+++ linux-2.6/drivers/gpu/drm/radeon/radeon_irq_kms.c
@@ -117,8 +117,7 @@ int radeon_irq_kms_init(struct radeon_de
 	/* MSIs don't seem to work reliably on all IGP
 	 * chips.  Disable MSI on them for now.
 	 */
-	if ((rdev->family >= CHIP_RV380) &&
-	    (!(rdev->flags & RADEON_IS_IGP))) {
+	if (rdev->family >= CHIP_RV380) {
 		int ret = pci_enable_msi(rdev->pdev);
 		if (!ret) {
 			rdev->msi_enabled = 1;

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: [Regression, post-rc2] Commit a5ee4eb7541 breaks OpenGL on RS780  (was: Re: Linux 2.6.34-rc3)
  2010-04-01 21:08                   ` Rafael J. Wysocki
@ 2010-04-01 21:13                       ` Alex Deucher
  0 siblings, 0 replies; 242+ messages in thread
From: Alex Deucher @ 2010-04-01 21:13 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Clemens Ladisch, Linus Torvalds, Linux PCI, Greg KH,
	Linux Kernel Mailing List, Jesse Barnes, dri-devel, stable,
	Dave Airlie

On Thu, Apr 1, 2010 at 5:08 PM, Rafael J. Wysocki <rjw@sisk.pl> wrote:
> On Thursday 01 April 2010, Alex Deucher wrote:
>> On Thu, Apr 1, 2010 at 4:48 PM, Rafael J. Wysocki <rjw@sisk.pl> wrote:
>> > On Thursday 01 April 2010, Alex Deucher wrote:
>> >> On Thu, Apr 1, 2010 at 4:28 PM, Rafael J. Wysocki <rjw@sisk.pl> wrote:
>> >> > On Thursday 01 April 2010, Alex Deucher wrote:
>> >> >> On Thu, Apr 1, 2010 at 2:36 AM, Clemens Ladisch <clemens@ladisch.de> wrote:
>> >> >> > Alex Deucher wrote:
>> >> >> >> On Wed, Mar 31, 2010 at 9:13 PM, Rafael J. Wysocki <rjw@sisk.pl> wrote:
>> >> >> >>> On Tuesday 30 March 2010, Rafael J. Wysocki wrote:
>> >> >> >>>> >       PCI quirk: RS780/RS880: work around missing MSI initialization
>> >> >> >>>>
>> >> >> >>>> This one (commit a5ee4eb7541) broke OpenGL acceleration on my new test box
>> >> >> >>>> which happens to have a RS780.
>> >> >> >
>> >> >> > So it's better to disable MSI unconditionally.
>> >> >> >
>> >> >> > Rafael, can you check if MSI works for the HDMI audio device?
>> >> >> > (I'd guess it doesn't.)
>> >> >> >
>> >> >> >> I also have the attached patch queued in via Dave's tree to disable
>> >> >> >> MSI on all IGP chips for the time being.
>> >> >> >
>> >> >> > This disables MSI only for the graphics device.  I'd prefer to have
>> >> >> > the quirk on its bridge so that MSI gets disabled for the HDMI audio
>> >> >> > device too, to avoid having to duplicate this quirk in the snd-hda-intel
>> >> >> > driver.
>> >> >> >
>> >> >> > ==========
>> >> >> >
>> >> >> > PCI quirk: RS780/RS880: disable MSI completely
>> >> >> >
>> >> >> > The missing initialization of the nb_cntl.strap_msi_enable does not seem
>> >> >> > to be the only problem that prevents MSI, so that quirk is not
>> >> >> > sufficient to enable MSI on all machines.  To be safe, unconditionally
>> >> >> > disable MSI for the internal graphics and HDMI audio on these chipsets.
>> >> >> >
>> >> >> > Signed-off-by: Clemens Ladisch <clemens@ladisch.de>
>> >> >>
>> >> >> Works fine here.
>> >> >>
>> >> >> Tested-by: Alex Deucher <alexdeucher@gmail.com>
>> >> >
>> >> > Unfortunately it doesn't work for me without the
>> >> >
>> >> > if ((rdev->family >= CHIP_RV380) &&
>> >> >            (!(rdev->flags & RADEON_IS_IGP)))
>> >> >
>> >> > radeon quirk.
>> >>
>> >> what are your pci ids?
>> >
>> > 1022:960b
>> >
>> > I guess 1022 is AMD.
>> >
>> > OK, I'll try to add that.
>>
>> It's possible your oem has the wrong vendor id for the 0x9602 bridge.
>
> Yes, the patch below works.
>
> Thanks,
> Rafael
>
>
> ---
>  drivers/gpu/drm/radeon/radeon_irq_kms.c |    3 --
>  drivers/pci/quirks.c                    |   36 ++------------------------------
>  2 files changed, 4 insertions(+), 35 deletions(-)
>
> Index: linux-2.6/drivers/pci/quirks.c
> ===================================================================
> --- linux-2.6.orig/drivers/pci/quirks.c
> +++ linux-2.6/drivers/pci/quirks.c
> @@ -2123,6 +2123,9 @@ static void __devinit quirk_disable_msi(
>        }
>  }
>  DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_8131_BRIDGE, quirk_disable_msi);
> +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_AMD, 0x9602, quirk_disable_msi);
> +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ASUSTEK, 0x9602, quirk_disable_msi);
> +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_AI, 0x9602, quirk_disable_msi);
>  DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_VIA, 0xa238, quirk_disable_msi);
>
>  /* Go through the list of Hypertransport capabilities and
> @@ -2495,39 +2498,6 @@ DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_AT
>  DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ATI, 0x4375,
>                        quirk_msi_intx_disable_bug);
>
> -/*
> - * MSI does not work with the AMD RS780/RS880 internal graphics and HDMI audio
> - * devices unless the BIOS has initialized the nb_cntl.strap_msi_enable bit.
> - */
> -static void __init rs780_int_gfx_disable_msi(struct pci_dev *int_gfx_bridge)
> -{
> -       u32 nb_cntl;
> -
> -       if (!int_gfx_bridge->subordinate)
> -               return;
> -
> -       pci_bus_write_config_dword(int_gfx_bridge->bus, PCI_DEVFN(0, 0),
> -                                  0x60, 0);
> -       pci_bus_read_config_dword(int_gfx_bridge->bus, PCI_DEVFN(0, 0),
> -                                 0x64, &nb_cntl);
> -
> -       if (!(nb_cntl & BIT(10))) {
> -               dev_warn(&int_gfx_bridge->dev,
> -                        FW_WARN "RS780: MSI for internal graphics disabled\n");
> -               int_gfx_bridge->subordinate->bus_flags |= PCI_BUS_FLAGS_NO_MSI;
> -       }
> -}
> -
> -#define PCI_DEVICE_ID_AMD_RS780_P2P_INT_GFX    0x9602
> -
> -DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_AMD,
> -                       PCI_DEVICE_ID_AMD_RS780_P2P_INT_GFX,
> -                       rs780_int_gfx_disable_msi);
> -/* wrong vendor ID on M4A785TD motherboard: */
> -DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ASUSTEK,
> -                       PCI_DEVICE_ID_AMD_RS780_P2P_INT_GFX,
> -                       rs780_int_gfx_disable_msi);
> -
>  #endif /* CONFIG_PCI_MSI */
>
>  #ifdef CONFIG_PCI_IOV
> Index: linux-2.6/drivers/gpu/drm/radeon/radeon_irq_kms.c
> ===================================================================
> --- linux-2.6.orig/drivers/gpu/drm/radeon/radeon_irq_kms.c
> +++ linux-2.6/drivers/gpu/drm/radeon/radeon_irq_kms.c
> @@ -117,8 +117,7 @@ int radeon_irq_kms_init(struct radeon_de
>        /* MSIs don't seem to work reliably on all IGP
>         * chips.  Disable MSI on them for now.
>         */
> -       if ((rdev->family >= CHIP_RV380) &&
> -           (!(rdev->flags & RADEON_IS_IGP))) {
> +       if (rdev->family >= CHIP_RV380) {
>                int ret = pci_enable_msi(rdev->pdev);
>                if (!ret) {
>                        rdev->msi_enabled = 1;
>

Let's skip this second chunk for now as there are other non-RS780 IGP
chips that could be problematic, so I'd rather just leave MSIs
disabled for now.

Alex

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: [Regression, post-rc2] Commit a5ee4eb7541 breaks OpenGL on RS780 (was: Re: Linux 2.6.34-rc3)
@ 2010-04-01 21:13                       ` Alex Deucher
  0 siblings, 0 replies; 242+ messages in thread
From: Alex Deucher @ 2010-04-01 21:13 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Clemens Ladisch, Linus Torvalds, Linux PCI, Greg KH,
	Linux Kernel Mailing List, Jesse Barnes, dri-devel, stable,
	Dave Airlie

On Thu, Apr 1, 2010 at 5:08 PM, Rafael J. Wysocki <rjw@sisk.pl> wrote:
> On Thursday 01 April 2010, Alex Deucher wrote:
>> On Thu, Apr 1, 2010 at 4:48 PM, Rafael J. Wysocki <rjw@sisk.pl> wrote:
>> > On Thursday 01 April 2010, Alex Deucher wrote:
>> >> On Thu, Apr 1, 2010 at 4:28 PM, Rafael J. Wysocki <rjw@sisk.pl> wrote:
>> >> > On Thursday 01 April 2010, Alex Deucher wrote:
>> >> >> On Thu, Apr 1, 2010 at 2:36 AM, Clemens Ladisch <clemens@ladisch.de> wrote:
>> >> >> > Alex Deucher wrote:
>> >> >> >> On Wed, Mar 31, 2010 at 9:13 PM, Rafael J. Wysocki <rjw@sisk.pl> wrote:
>> >> >> >>> On Tuesday 30 March 2010, Rafael J. Wysocki wrote:
>> >> >> >>>> >       PCI quirk: RS780/RS880: work around missing MSI initialization
>> >> >> >>>>
>> >> >> >>>> This one (commit a5ee4eb7541) broke OpenGL acceleration on my new test box
>> >> >> >>>> which happens to have a RS780.
>> >> >> >
>> >> >> > So it's better to disable MSI unconditionally.
>> >> >> >
>> >> >> > Rafael, can you check if MSI works for the HDMI audio device?
>> >> >> > (I'd guess it doesn't.)
>> >> >> >
>> >> >> >> I also have the attached patch queued in via Dave's tree to disable
>> >> >> >> MSI on all IGP chips for the time being.
>> >> >> >
>> >> >> > This disables MSI only for the graphics device.  I'd prefer to have
>> >> >> > the quirk on its bridge so that MSI gets disabled for the HDMI audio
>> >> >> > device too, to avoid having to duplicate this quirk in the snd-hda-intel
>> >> >> > driver.
>> >> >> >
>> >> >> > ==========
>> >> >> >
>> >> >> > PCI quirk: RS780/RS880: disable MSI completely
>> >> >> >
>> >> >> > The missing initialization of the nb_cntl.strap_msi_enable does not seem
>> >> >> > to be the only problem that prevents MSI, so that quirk is not
>> >> >> > sufficient to enable MSI on all machines.  To be safe, unconditionally
>> >> >> > disable MSI for the internal graphics and HDMI audio on these chipsets.
>> >> >> >
>> >> >> > Signed-off-by: Clemens Ladisch <clemens@ladisch.de>
>> >> >>
>> >> >> Works fine here.
>> >> >>
>> >> >> Tested-by: Alex Deucher <alexdeucher@gmail.com>
>> >> >
>> >> > Unfortunately it doesn't work for me without the
>> >> >
>> >> > if ((rdev->family >= CHIP_RV380) &&
>> >> >            (!(rdev->flags & RADEON_IS_IGP)))
>> >> >
>> >> > radeon quirk.
>> >>
>> >> what are your pci ids?
>> >
>> > 1022:960b
>> >
>> > I guess 1022 is AMD.
>> >
>> > OK, I'll try to add that.
>>
>> It's possible your oem has the wrong vendor id for the 0x9602 bridge.
>
> Yes, the patch below works.
>
> Thanks,
> Rafael
>
>
> ---
>  drivers/gpu/drm/radeon/radeon_irq_kms.c |    3 --
>  drivers/pci/quirks.c                    |   36 ++------------------------------
>  2 files changed, 4 insertions(+), 35 deletions(-)
>
> Index: linux-2.6/drivers/pci/quirks.c
> ===================================================================
> --- linux-2.6.orig/drivers/pci/quirks.c
> +++ linux-2.6/drivers/pci/quirks.c
> @@ -2123,6 +2123,9 @@ static void __devinit quirk_disable_msi(
>        }
>  }
>  DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_8131_BRIDGE, quirk_disable_msi);
> +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_AMD, 0x9602, quirk_disable_msi);
> +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ASUSTEK, 0x9602, quirk_disable_msi);
> +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_AI, 0x9602, quirk_disable_msi);
>  DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_VIA, 0xa238, quirk_disable_msi);
>
>  /* Go through the list of Hypertransport capabilities and
> @@ -2495,39 +2498,6 @@ DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_AT
>  DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ATI, 0x4375,
>                        quirk_msi_intx_disable_bug);
>
> -/*
> - * MSI does not work with the AMD RS780/RS880 internal graphics and HDMI audio
> - * devices unless the BIOS has initialized the nb_cntl.strap_msi_enable bit.
> - */
> -static void __init rs780_int_gfx_disable_msi(struct pci_dev *int_gfx_bridge)
> -{
> -       u32 nb_cntl;
> -
> -       if (!int_gfx_bridge->subordinate)
> -               return;
> -
> -       pci_bus_write_config_dword(int_gfx_bridge->bus, PCI_DEVFN(0, 0),
> -                                  0x60, 0);
> -       pci_bus_read_config_dword(int_gfx_bridge->bus, PCI_DEVFN(0, 0),
> -                                 0x64, &nb_cntl);
> -
> -       if (!(nb_cntl & BIT(10))) {
> -               dev_warn(&int_gfx_bridge->dev,
> -                        FW_WARN "RS780: MSI for internal graphics disabled\n");
> -               int_gfx_bridge->subordinate->bus_flags |= PCI_BUS_FLAGS_NO_MSI;
> -       }
> -}
> -
> -#define PCI_DEVICE_ID_AMD_RS780_P2P_INT_GFX    0x9602
> -
> -DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_AMD,
> -                       PCI_DEVICE_ID_AMD_RS780_P2P_INT_GFX,
> -                       rs780_int_gfx_disable_msi);
> -/* wrong vendor ID on M4A785TD motherboard: */
> -DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ASUSTEK,
> -                       PCI_DEVICE_ID_AMD_RS780_P2P_INT_GFX,
> -                       rs780_int_gfx_disable_msi);
> -
>  #endif /* CONFIG_PCI_MSI */
>
>  #ifdef CONFIG_PCI_IOV
> Index: linux-2.6/drivers/gpu/drm/radeon/radeon_irq_kms.c
> ===================================================================
> --- linux-2.6.orig/drivers/gpu/drm/radeon/radeon_irq_kms.c
> +++ linux-2.6/drivers/gpu/drm/radeon/radeon_irq_kms.c
> @@ -117,8 +117,7 @@ int radeon_irq_kms_init(struct radeon_de
>        /* MSIs don't seem to work reliably on all IGP
>         * chips.  Disable MSI on them for now.
>         */
> -       if ((rdev->family >= CHIP_RV380) &&
> -           (!(rdev->flags & RADEON_IS_IGP))) {
> +       if (rdev->family >= CHIP_RV380) {
>                int ret = pci_enable_msi(rdev->pdev);
>                if (!ret) {
>                        rdev->msi_enabled = 1;
>

Let's skip this second chunk for now as there are other non-RS780 IGP
chips that could be problematic, so I'd rather just leave MSIs
disabled for now.

Alex

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: [Regression, post-rc2] Commit a5ee4eb7541 breaks OpenGL on RS780 (was: Re: Linux 2.6.34-rc3)
  2010-04-01 21:13                       ` Alex Deucher
  (?)
@ 2010-04-01 21:46                       ` Rafael J. Wysocki
  2010-04-01 22:07                           ` Alex Deucher
  -1 siblings, 1 reply; 242+ messages in thread
From: Rafael J. Wysocki @ 2010-04-01 21:46 UTC (permalink / raw)
  To: Alex Deucher
  Cc: Clemens Ladisch, Linus Torvalds, Linux PCI, Greg KH,
	Linux Kernel Mailing List, Jesse Barnes, dri-devel, stable,
	Dave Airlie

On Thursday 01 April 2010, Alex Deucher wrote:
> On Thu, Apr 1, 2010 at 5:08 PM, Rafael J. Wysocki <rjw@sisk.pl> wrote:
> > On Thursday 01 April 2010, Alex Deucher wrote:
> >> On Thu, Apr 1, 2010 at 4:48 PM, Rafael J. Wysocki <rjw@sisk.pl> wrote:
> >> > On Thursday 01 April 2010, Alex Deucher wrote:
> >> >> On Thu, Apr 1, 2010 at 4:28 PM, Rafael J. Wysocki <rjw@sisk.pl> wrote:
> >> >> > On Thursday 01 April 2010, Alex Deucher wrote:
> >> >> >> On Thu, Apr 1, 2010 at 2:36 AM, Clemens Ladisch <clemens@ladisch.de> wrote:
> >> >> >> > Alex Deucher wrote:
> >> >> >> >> On Wed, Mar 31, 2010 at 9:13 PM, Rafael J. Wysocki <rjw@sisk.pl> wrote:
> >> >> >> >>> On Tuesday 30 March 2010, Rafael J. Wysocki wrote:
> >> >> >> >>>> >       PCI quirk: RS780/RS880: work around missing MSI initialization
> >> >> >> >>>>
> >> >> >> >>>> This one (commit a5ee4eb7541) broke OpenGL acceleration on my new test box
> >> >> >> >>>> which happens to have a RS780.
> >> >> >> >
> >> >> >> > So it's better to disable MSI unconditionally.
> >> >> >> >
> >> >> >> > Rafael, can you check if MSI works for the HDMI audio device?
> >> >> >> > (I'd guess it doesn't.)
> >> >> >> >
> >> >> >> >> I also have the attached patch queued in via Dave's tree to disable
> >> >> >> >> MSI on all IGP chips for the time being.
> >> >> >> >
> >> >> >> > This disables MSI only for the graphics device.  I'd prefer to have
> >> >> >> > the quirk on its bridge so that MSI gets disabled for the HDMI audio
> >> >> >> > device too, to avoid having to duplicate this quirk in the snd-hda-intel
> >> >> >> > driver.
> >> >> >> >
> >> >> >> > ==========
> >> >> >> >
> >> >> >> > PCI quirk: RS780/RS880: disable MSI completely
> >> >> >> >
> >> >> >> > The missing initialization of the nb_cntl.strap_msi_enable does not seem
> >> >> >> > to be the only problem that prevents MSI, so that quirk is not
> >> >> >> > sufficient to enable MSI on all machines.  To be safe, unconditionally
> >> >> >> > disable MSI for the internal graphics and HDMI audio on these chipsets.
> >> >> >> >
> >> >> >> > Signed-off-by: Clemens Ladisch <clemens@ladisch.de>
> >> >> >>
> >> >> >> Works fine here.
> >> >> >>
> >> >> >> Tested-by: Alex Deucher <alexdeucher@gmail.com>
> >> >> >
> >> >> > Unfortunately it doesn't work for me without the
> >> >> >
> >> >> > if ((rdev->family >= CHIP_RV380) &&
> >> >> >            (!(rdev->flags & RADEON_IS_IGP)))
> >> >> >
> >> >> > radeon quirk.
> >> >>
> >> >> what are your pci ids?
> >> >
> >> > 1022:960b
> >> >
> >> > I guess 1022 is AMD.
> >> >
> >> > OK, I'll try to add that.
> >>
> >> It's possible your oem has the wrong vendor id for the 0x9602 bridge.
> >
> > Yes, the patch below works.
> >
> > Thanks,
> > Rafael
> >
> >
> > ---
> >  drivers/gpu/drm/radeon/radeon_irq_kms.c |    3 --
> >  drivers/pci/quirks.c                    |   36 ++------------------------------
> >  2 files changed, 4 insertions(+), 35 deletions(-)
> >
> > Index: linux-2.6/drivers/pci/quirks.c
> > ===================================================================
> > --- linux-2.6.orig/drivers/pci/quirks.c
> > +++ linux-2.6/drivers/pci/quirks.c
> > @@ -2123,6 +2123,9 @@ static void __devinit quirk_disable_msi(
> >        }
> >  }
> >  DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_8131_BRIDGE, quirk_disable_msi);
> > +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_AMD, 0x9602, quirk_disable_msi);
> > +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ASUSTEK, 0x9602, quirk_disable_msi);
> > +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_AI, 0x9602, quirk_disable_msi);
> >  DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_VIA, 0xa238, quirk_disable_msi);
> >
> >  /* Go through the list of Hypertransport capabilities and
> > @@ -2495,39 +2498,6 @@ DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_AT
> >  DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ATI, 0x4375,
> >                        quirk_msi_intx_disable_bug);
> >
> > -/*
> > - * MSI does not work with the AMD RS780/RS880 internal graphics and HDMI audio
> > - * devices unless the BIOS has initialized the nb_cntl.strap_msi_enable bit.
> > - */
> > -static void __init rs780_int_gfx_disable_msi(struct pci_dev *int_gfx_bridge)
> > -{
> > -       u32 nb_cntl;
> > -
> > -       if (!int_gfx_bridge->subordinate)
> > -               return;
> > -
> > -       pci_bus_write_config_dword(int_gfx_bridge->bus, PCI_DEVFN(0, 0),
> > -                                  0x60, 0);
> > -       pci_bus_read_config_dword(int_gfx_bridge->bus, PCI_DEVFN(0, 0),
> > -                                 0x64, &nb_cntl);
> > -
> > -       if (!(nb_cntl & BIT(10))) {
> > -               dev_warn(&int_gfx_bridge->dev,
> > -                        FW_WARN "RS780: MSI for internal graphics disabled\n");
> > -               int_gfx_bridge->subordinate->bus_flags |= PCI_BUS_FLAGS_NO_MSI;
> > -       }
> > -}
> > -
> > -#define PCI_DEVICE_ID_AMD_RS780_P2P_INT_GFX    0x9602
> > -
> > -DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_AMD,
> > -                       PCI_DEVICE_ID_AMD_RS780_P2P_INT_GFX,
> > -                       rs780_int_gfx_disable_msi);
> > -/* wrong vendor ID on M4A785TD motherboard: */
> > -DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ASUSTEK,
> > -                       PCI_DEVICE_ID_AMD_RS780_P2P_INT_GFX,
> > -                       rs780_int_gfx_disable_msi);
> > -
> >  #endif /* CONFIG_PCI_MSI */
> >
> >  #ifdef CONFIG_PCI_IOV
> > Index: linux-2.6/drivers/gpu/drm/radeon/radeon_irq_kms.c
> > ===================================================================
> > --- linux-2.6.orig/drivers/gpu/drm/radeon/radeon_irq_kms.c
> > +++ linux-2.6/drivers/gpu/drm/radeon/radeon_irq_kms.c
> > @@ -117,8 +117,7 @@ int radeon_irq_kms_init(struct radeon_de
> >        /* MSIs don't seem to work reliably on all IGP
> >         * chips.  Disable MSI on them for now.
> >         */
> > -       if ((rdev->family >= CHIP_RV380) &&
> > -           (!(rdev->flags & RADEON_IS_IGP))) {
> > +       if (rdev->family >= CHIP_RV380) {
> >                int ret = pci_enable_msi(rdev->pdev);
> >                if (!ret) {
> >                        rdev->msi_enabled = 1;
> >
> 
> Let's skip this second chunk for now as there are other non-RS780 IGP
> chips that could be problematic, so I'd rather just leave MSIs
> disabled for now.

Works for me.

So do you want me to resubmit?

Rafael

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: [Regression, post-rc2] Commit a5ee4eb7541 breaks OpenGL on RS780  (was: Re: Linux 2.6.34-rc3)
  2010-04-01 21:46                       ` Rafael J. Wysocki
@ 2010-04-01 22:07                           ` Alex Deucher
  0 siblings, 0 replies; 242+ messages in thread
From: Alex Deucher @ 2010-04-01 22:07 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Clemens Ladisch, Linus Torvalds, Linux PCI, Greg KH,
	Linux Kernel Mailing List, Jesse Barnes, dri-devel, stable,
	Dave Airlie

On Thu, Apr 1, 2010 at 5:46 PM, Rafael J. Wysocki <rjw@sisk.pl> wrote:
> On Thursday 01 April 2010, Alex Deucher wrote:
>> On Thu, Apr 1, 2010 at 5:08 PM, Rafael J. Wysocki <rjw@sisk.pl> wrote:
>> > On Thursday 01 April 2010, Alex Deucher wrote:
>> >> On Thu, Apr 1, 2010 at 4:48 PM, Rafael J. Wysocki <rjw@sisk.pl> wrote:
>> >> > On Thursday 01 April 2010, Alex Deucher wrote:
>> >> >> On Thu, Apr 1, 2010 at 4:28 PM, Rafael J. Wysocki <rjw@sisk.pl> wrote:
>> >> >> > On Thursday 01 April 2010, Alex Deucher wrote:
>> >> >> >> On Thu, Apr 1, 2010 at 2:36 AM, Clemens Ladisch <clemens@ladisch.de> wrote:
>> >> >> >> > Alex Deucher wrote:
>> >> >> >> >> On Wed, Mar 31, 2010 at 9:13 PM, Rafael J. Wysocki <rjw@sisk.pl> wrote:
>> >> >> >> >>> On Tuesday 30 March 2010, Rafael J. Wysocki wrote:
>> >> >> >> >>>> >       PCI quirk: RS780/RS880: work around missing MSI initialization
>> >> >> >> >>>>
>> >> >> >> >>>> This one (commit a5ee4eb7541) broke OpenGL acceleration on my new test box
>> >> >> >> >>>> which happens to have a RS780.
>> >> >> >> >
>> >> >> >> > So it's better to disable MSI unconditionally.
>> >> >> >> >
>> >> >> >> > Rafael, can you check if MSI works for the HDMI audio device?
>> >> >> >> > (I'd guess it doesn't.)
>> >> >> >> >
>> >> >> >> >> I also have the attached patch queued in via Dave's tree to disable
>> >> >> >> >> MSI on all IGP chips for the time being.
>> >> >> >> >
>> >> >> >> > This disables MSI only for the graphics device.  I'd prefer to have
>> >> >> >> > the quirk on its bridge so that MSI gets disabled for the HDMI audio
>> >> >> >> > device too, to avoid having to duplicate this quirk in the snd-hda-intel
>> >> >> >> > driver.
>> >> >> >> >
>> >> >> >> > ==========
>> >> >> >> >
>> >> >> >> > PCI quirk: RS780/RS880: disable MSI completely
>> >> >> >> >
>> >> >> >> > The missing initialization of the nb_cntl.strap_msi_enable does not seem
>> >> >> >> > to be the only problem that prevents MSI, so that quirk is not
>> >> >> >> > sufficient to enable MSI on all machines.  To be safe, unconditionally
>> >> >> >> > disable MSI for the internal graphics and HDMI audio on these chipsets.
>> >> >> >> >
>> >> >> >> > Signed-off-by: Clemens Ladisch <clemens@ladisch.de>
>> >> >> >>
>> >> >> >> Works fine here.
>> >> >> >>
>> >> >> >> Tested-by: Alex Deucher <alexdeucher@gmail.com>
>> >> >> >
>> >> >> > Unfortunately it doesn't work for me without the
>> >> >> >
>> >> >> > if ((rdev->family >= CHIP_RV380) &&
>> >> >> >            (!(rdev->flags & RADEON_IS_IGP)))
>> >> >> >
>> >> >> > radeon quirk.
>> >> >>
>> >> >> what are your pci ids?
>> >> >
>> >> > 1022:960b
>> >> >
>> >> > I guess 1022 is AMD.
>> >> >
>> >> > OK, I'll try to add that.
>> >>
>> >> It's possible your oem has the wrong vendor id for the 0x9602 bridge.
>> >
>> > Yes, the patch below works.
>> >
>> > Thanks,
>> > Rafael
>> >
>> >
>> > ---
>> >  drivers/gpu/drm/radeon/radeon_irq_kms.c |    3 --
>> >  drivers/pci/quirks.c                    |   36 ++------------------------------
>> >  2 files changed, 4 insertions(+), 35 deletions(-)
>> >
>> > Index: linux-2.6/drivers/pci/quirks.c
>> > ===================================================================
>> > --- linux-2.6.orig/drivers/pci/quirks.c
>> > +++ linux-2.6/drivers/pci/quirks.c
>> > @@ -2123,6 +2123,9 @@ static void __devinit quirk_disable_msi(
>> >        }
>> >  }
>> >  DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_8131_BRIDGE, quirk_disable_msi);
>> > +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_AMD, 0x9602, quirk_disable_msi);
>> > +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ASUSTEK, 0x9602, quirk_disable_msi);
>> > +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_AI, 0x9602, quirk_disable_msi);
>> >  DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_VIA, 0xa238, quirk_disable_msi);
>> >
>> >  /* Go through the list of Hypertransport capabilities and
>> > @@ -2495,39 +2498,6 @@ DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_AT
>> >  DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ATI, 0x4375,
>> >                        quirk_msi_intx_disable_bug);
>> >
>> > -/*
>> > - * MSI does not work with the AMD RS780/RS880 internal graphics and HDMI audio
>> > - * devices unless the BIOS has initialized the nb_cntl.strap_msi_enable bit.
>> > - */
>> > -static void __init rs780_int_gfx_disable_msi(struct pci_dev *int_gfx_bridge)
>> > -{
>> > -       u32 nb_cntl;
>> > -
>> > -       if (!int_gfx_bridge->subordinate)
>> > -               return;
>> > -
>> > -       pci_bus_write_config_dword(int_gfx_bridge->bus, PCI_DEVFN(0, 0),
>> > -                                  0x60, 0);
>> > -       pci_bus_read_config_dword(int_gfx_bridge->bus, PCI_DEVFN(0, 0),
>> > -                                 0x64, &nb_cntl);
>> > -
>> > -       if (!(nb_cntl & BIT(10))) {
>> > -               dev_warn(&int_gfx_bridge->dev,
>> > -                        FW_WARN "RS780: MSI for internal graphics disabled\n");
>> > -               int_gfx_bridge->subordinate->bus_flags |= PCI_BUS_FLAGS_NO_MSI;
>> > -       }
>> > -}
>> > -
>> > -#define PCI_DEVICE_ID_AMD_RS780_P2P_INT_GFX    0x9602
>> > -
>> > -DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_AMD,
>> > -                       PCI_DEVICE_ID_AMD_RS780_P2P_INT_GFX,
>> > -                       rs780_int_gfx_disable_msi);
>> > -/* wrong vendor ID on M4A785TD motherboard: */
>> > -DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ASUSTEK,
>> > -                       PCI_DEVICE_ID_AMD_RS780_P2P_INT_GFX,
>> > -                       rs780_int_gfx_disable_msi);
>> > -
>> >  #endif /* CONFIG_PCI_MSI */
>> >
>> >  #ifdef CONFIG_PCI_IOV
>> > Index: linux-2.6/drivers/gpu/drm/radeon/radeon_irq_kms.c
>> > ===================================================================
>> > --- linux-2.6.orig/drivers/gpu/drm/radeon/radeon_irq_kms.c
>> > +++ linux-2.6/drivers/gpu/drm/radeon/radeon_irq_kms.c
>> > @@ -117,8 +117,7 @@ int radeon_irq_kms_init(struct radeon_de
>> >        /* MSIs don't seem to work reliably on all IGP
>> >         * chips.  Disable MSI on them for now.
>> >         */
>> > -       if ((rdev->family >= CHIP_RV380) &&
>> > -           (!(rdev->flags & RADEON_IS_IGP))) {
>> > +       if (rdev->family >= CHIP_RV380) {
>> >                int ret = pci_enable_msi(rdev->pdev);
>> >                if (!ret) {
>> >                        rdev->msi_enabled = 1;
>> >
>>
>> Let's skip this second chunk for now as there are other non-RS780 IGP
>> chips that could be problematic, so I'd rather just leave MSIs
>> disabled for now.
>
> Works for me.
>
> So do you want me to resubmit?
>

Please.

Thanks,

Alex

> Rafael
>

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: [Regression, post-rc2] Commit a5ee4eb7541 breaks OpenGL on RS780 (was: Re: Linux 2.6.34-rc3)
@ 2010-04-01 22:07                           ` Alex Deucher
  0 siblings, 0 replies; 242+ messages in thread
From: Alex Deucher @ 2010-04-01 22:07 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Clemens Ladisch, Linus Torvalds, Linux PCI, Greg KH,
	Linux Kernel Mailing List, Jesse Barnes, dri-devel, stable,
	Dave Airlie

On Thu, Apr 1, 2010 at 5:46 PM, Rafael J. Wysocki <rjw@sisk.pl> wrote:
> On Thursday 01 April 2010, Alex Deucher wrote:
>> On Thu, Apr 1, 2010 at 5:08 PM, Rafael J. Wysocki <rjw@sisk.pl> wrote:
>> > On Thursday 01 April 2010, Alex Deucher wrote:
>> >> On Thu, Apr 1, 2010 at 4:48 PM, Rafael J. Wysocki <rjw@sisk.pl> wrote:
>> >> > On Thursday 01 April 2010, Alex Deucher wrote:
>> >> >> On Thu, Apr 1, 2010 at 4:28 PM, Rafael J. Wysocki <rjw@sisk.pl> wrote:
>> >> >> > On Thursday 01 April 2010, Alex Deucher wrote:
>> >> >> >> On Thu, Apr 1, 2010 at 2:36 AM, Clemens Ladisch <clemens@ladisch.de> wrote:
>> >> >> >> > Alex Deucher wrote:
>> >> >> >> >> On Wed, Mar 31, 2010 at 9:13 PM, Rafael J. Wysocki <rjw@sisk.pl> wrote:
>> >> >> >> >>> On Tuesday 30 March 2010, Rafael J. Wysocki wrote:
>> >> >> >> >>>> >       PCI quirk: RS780/RS880: work around missing MSI initialization
>> >> >> >> >>>>
>> >> >> >> >>>> This one (commit a5ee4eb7541) broke OpenGL acceleration on my new test box
>> >> >> >> >>>> which happens to have a RS780.
>> >> >> >> >
>> >> >> >> > So it's better to disable MSI unconditionally.
>> >> >> >> >
>> >> >> >> > Rafael, can you check if MSI works for the HDMI audio device?
>> >> >> >> > (I'd guess it doesn't.)
>> >> >> >> >
>> >> >> >> >> I also have the attached patch queued in via Dave's tree to disable
>> >> >> >> >> MSI on all IGP chips for the time being.
>> >> >> >> >
>> >> >> >> > This disables MSI only for the graphics device.  I'd prefer to have
>> >> >> >> > the quirk on its bridge so that MSI gets disabled for the HDMI audio
>> >> >> >> > device too, to avoid having to duplicate this quirk in the snd-hda-intel
>> >> >> >> > driver.
>> >> >> >> >
>> >> >> >> > ==========
>> >> >> >> >
>> >> >> >> > PCI quirk: RS780/RS880: disable MSI completely
>> >> >> >> >
>> >> >> >> > The missing initialization of the nb_cntl.strap_msi_enable does not seem
>> >> >> >> > to be the only problem that prevents MSI, so that quirk is not
>> >> >> >> > sufficient to enable MSI on all machines.  To be safe, unconditionally
>> >> >> >> > disable MSI for the internal graphics and HDMI audio on these chipsets.
>> >> >> >> >
>> >> >> >> > Signed-off-by: Clemens Ladisch <clemens@ladisch.de>
>> >> >> >>
>> >> >> >> Works fine here.
>> >> >> >>
>> >> >> >> Tested-by: Alex Deucher <alexdeucher@gmail.com>
>> >> >> >
>> >> >> > Unfortunately it doesn't work for me without the
>> >> >> >
>> >> >> > if ((rdev->family >= CHIP_RV380) &&
>> >> >> >            (!(rdev->flags & RADEON_IS_IGP)))
>> >> >> >
>> >> >> > radeon quirk.
>> >> >>
>> >> >> what are your pci ids?
>> >> >
>> >> > 1022:960b
>> >> >
>> >> > I guess 1022 is AMD.
>> >> >
>> >> > OK, I'll try to add that.
>> >>
>> >> It's possible your oem has the wrong vendor id for the 0x9602 bridge.
>> >
>> > Yes, the patch below works.
>> >
>> > Thanks,
>> > Rafael
>> >
>> >
>> > ---
>> >  drivers/gpu/drm/radeon/radeon_irq_kms.c |    3 --
>> >  drivers/pci/quirks.c                    |   36 ++------------------------------
>> >  2 files changed, 4 insertions(+), 35 deletions(-)
>> >
>> > Index: linux-2.6/drivers/pci/quirks.c
>> > ===================================================================
>> > --- linux-2.6.orig/drivers/pci/quirks.c
>> > +++ linux-2.6/drivers/pci/quirks.c
>> > @@ -2123,6 +2123,9 @@ static void __devinit quirk_disable_msi(
>> >        }
>> >  }
>> >  DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_8131_BRIDGE, quirk_disable_msi);
>> > +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_AMD, 0x9602, quirk_disable_msi);
>> > +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ASUSTEK, 0x9602, quirk_disable_msi);
>> > +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_AI, 0x9602, quirk_disable_msi);
>> >  DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_VIA, 0xa238, quirk_disable_msi);
>> >
>> >  /* Go through the list of Hypertransport capabilities and
>> > @@ -2495,39 +2498,6 @@ DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_AT
>> >  DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ATI, 0x4375,
>> >                        quirk_msi_intx_disable_bug);
>> >
>> > -/*
>> > - * MSI does not work with the AMD RS780/RS880 internal graphics and HDMI audio
>> > - * devices unless the BIOS has initialized the nb_cntl.strap_msi_enable bit.
>> > - */
>> > -static void __init rs780_int_gfx_disable_msi(struct pci_dev *int_gfx_bridge)
>> > -{
>> > -       u32 nb_cntl;
>> > -
>> > -       if (!int_gfx_bridge->subordinate)
>> > -               return;
>> > -
>> > -       pci_bus_write_config_dword(int_gfx_bridge->bus, PCI_DEVFN(0, 0),
>> > -                                  0x60, 0);
>> > -       pci_bus_read_config_dword(int_gfx_bridge->bus, PCI_DEVFN(0, 0),
>> > -                                 0x64, &nb_cntl);
>> > -
>> > -       if (!(nb_cntl & BIT(10))) {
>> > -               dev_warn(&int_gfx_bridge->dev,
>> > -                        FW_WARN "RS780: MSI for internal graphics disabled\n");
>> > -               int_gfx_bridge->subordinate->bus_flags |= PCI_BUS_FLAGS_NO_MSI;
>> > -       }
>> > -}
>> > -
>> > -#define PCI_DEVICE_ID_AMD_RS780_P2P_INT_GFX    0x9602
>> > -
>> > -DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_AMD,
>> > -                       PCI_DEVICE_ID_AMD_RS780_P2P_INT_GFX,
>> > -                       rs780_int_gfx_disable_msi);
>> > -/* wrong vendor ID on M4A785TD motherboard: */
>> > -DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ASUSTEK,
>> > -                       PCI_DEVICE_ID_AMD_RS780_P2P_INT_GFX,
>> > -                       rs780_int_gfx_disable_msi);
>> > -
>> >  #endif /* CONFIG_PCI_MSI */
>> >
>> >  #ifdef CONFIG_PCI_IOV
>> > Index: linux-2.6/drivers/gpu/drm/radeon/radeon_irq_kms.c
>> > ===================================================================
>> > --- linux-2.6.orig/drivers/gpu/drm/radeon/radeon_irq_kms.c
>> > +++ linux-2.6/drivers/gpu/drm/radeon/radeon_irq_kms.c
>> > @@ -117,8 +117,7 @@ int radeon_irq_kms_init(struct radeon_de
>> >        /* MSIs don't seem to work reliably on all IGP
>> >         * chips.  Disable MSI on them for now.
>> >         */
>> > -       if ((rdev->family >= CHIP_RV380) &&
>> > -           (!(rdev->flags & RADEON_IS_IGP))) {
>> > +       if (rdev->family >= CHIP_RV380) {
>> >                int ret = pci_enable_msi(rdev->pdev);
>> >                if (!ret) {
>> >                        rdev->msi_enabled = 1;
>> >
>>
>> Let's skip this second chunk for now as there are other non-RS780 IGP
>> chips that could be problematic, so I'd rather just leave MSIs
>> disabled for now.
>
> Works for me.
>
> So do you want me to resubmit?
>

Please.

Thanks,

Alex

> Rafael
>

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: [Regression, post-rc2] Commit a5ee4eb7541 breaks OpenGL on RS780 (was: Re: Linux 2.6.34-rc3)
  2010-04-01 16:29     ` Linus Torvalds
  2010-04-01 17:07         ` Alex Deucher
  2010-04-01 19:46       ` Rafael J. Wysocki
@ 2010-04-01 22:48       ` Jesse Barnes
  2010-04-01 23:23         ` Rafael J. Wysocki
  2 siblings, 1 reply; 242+ messages in thread
From: Jesse Barnes @ 2010-04-01 22:48 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Rafael J. Wysocki, Linux Kernel Mailing List, Dave Airlie,
	dri-devel, Linux PCI, Clemens Ladisch, Alex Deucher, stable,
	Greg KH

On Thu, 1 Apr 2010 09:29:23 -0700 (PDT)
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> 
> 
> On Thu, 1 Apr 2010, Rafael J. Wysocki wrote:
> > 
> > OK, I've verified that partial revert (below) is sufficient.
> 
> Hmm. Through the DRM merge I just did, this area actually conflicted, and 
> the resolved version is now
> 
>         if ((rdev->family >= CHIP_RV380) &&
>             (!(rdev->flags & RADEON_IS_IGP))) {
> 
> which presumably also fixes your issue?
> 
> [ Side note: somebody in the DRM tree seems to be way too used to LISP, 
>   and thinks that adding parenthesis always improves the code ;-]
> 
> However, I do suspect that we should probably revert the quirk regardless 
> as being useless (ie it probably was related to those IGP chips that 
> apparently don't do MSI anyway).
> 
> So the patch that reverts the quirk by Clemens (to replace it with 
> disabling MSI entirely when the AMD NB doesn't accept them) seems to be a 
> good idea regardless, since it's apparently not just about gfx. Jesse?

Yeah, that sounds fine.  I can include it in my next pull req or you
can just pick it up directly.

-- 
Jesse Barnes, Intel Open Source Technology Center

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: [Regression, post-rc2] Commit a5ee4eb7541 breaks OpenGL on RS780 (was: Re: Linux 2.6.34-rc3)
  2010-04-01 22:07                           ` Alex Deucher
  (?)
@ 2010-04-01 23:20                           ` Rafael J. Wysocki
  2010-04-02  0:23                             ` Linus Torvalds
  -1 siblings, 1 reply; 242+ messages in thread
From: Rafael J. Wysocki @ 2010-04-01 23:20 UTC (permalink / raw)
  To: Alex Deucher
  Cc: Clemens Ladisch, Linus Torvalds, Linux PCI, Greg KH,
	Linux Kernel Mailing List, Jesse Barnes, dri-devel, stable,
	Dave Airlie

On Friday 02 April 2010, Alex Deucher wrote:
> On Thu, Apr 1, 2010 at 5:46 PM, Rafael J. Wysocki <rjw@sisk.pl> wrote:
> > On Thursday 01 April 2010, Alex Deucher wrote:
> >> On Thu, Apr 1, 2010 at 5:08 PM, Rafael J. Wysocki <rjw@sisk.pl> wrote:
> >> > On Thursday 01 April 2010, Alex Deucher wrote:
> >> >> On Thu, Apr 1, 2010 at 4:48 PM, Rafael J. Wysocki <rjw@sisk.pl> wrote:
> >> >> > On Thursday 01 April 2010, Alex Deucher wrote:
> >> >> >> On Thu, Apr 1, 2010 at 4:28 PM, Rafael J. Wysocki <rjw@sisk.pl> wrote:
> >> >> >> > On Thursday 01 April 2010, Alex Deucher wrote:
> >> >> >> >> On Thu, Apr 1, 2010 at 2:36 AM, Clemens Ladisch <clemens@ladisch.de> wrote:
> >> >> >> >> > Alex Deucher wrote:
> >> >> >> >> >> On Wed, Mar 31, 2010 at 9:13 PM, Rafael J. Wysocki <rjw@sisk.pl> wrote:
> >> >> >> >> >>> On Tuesday 30 March 2010, Rafael J. Wysocki wrote:
...
> > So do you want me to resubmit?
> >
> 
> Please.

Appended, with sign-offs and changelog.

Thanks,
Rafael

---
Subject: PCI quirk: RS780/RS880: disable MSI completely

The missing initialization of the nb_cntl.strap_msi_enable does not
seem to be the only problem that prevents MSI, so that quirk is not
sufficient to enable MSI on all machines.  To be safe, disable MSI
unconditionally for the internal graphics and HDMI audio on these
chipsets.

[rjw: Added the PCI_VENDOR_ID_AI quirk.]

Signed-off-by: Clemens Ladisch <clemens@ladisch.de>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
---
 drivers/pci/quirks.c |   36 +++---------------------------------
 1 file changed, 3 insertions(+), 33 deletions(-)

Index: linux-2.6/drivers/pci/quirks.c
===================================================================
--- linux-2.6.orig/drivers/pci/quirks.c
+++ linux-2.6/drivers/pci/quirks.c
@@ -2123,6 +2123,9 @@ static void __devinit quirk_disable_msi(
 	}
 }
 DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_8131_BRIDGE, quirk_disable_msi);
+DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_AMD, 0x9602, quirk_disable_msi);
+DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ASUSTEK, 0x9602, quirk_disable_msi);
+DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_AI, 0x9602, quirk_disable_msi);
 DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_VIA, 0xa238, quirk_disable_msi);
 
 /* Go through the list of Hypertransport capabilities and
@@ -2495,39 +2498,6 @@ DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_AT
 DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ATI, 0x4375,
 			quirk_msi_intx_disable_bug);
 
-/*
- * MSI does not work with the AMD RS780/RS880 internal graphics and HDMI audio
- * devices unless the BIOS has initialized the nb_cntl.strap_msi_enable bit.
- */
-static void __init rs780_int_gfx_disable_msi(struct pci_dev *int_gfx_bridge)
-{
-	u32 nb_cntl;
-
-	if (!int_gfx_bridge->subordinate)
-		return;
-
-	pci_bus_write_config_dword(int_gfx_bridge->bus, PCI_DEVFN(0, 0),
-				   0x60, 0);
-	pci_bus_read_config_dword(int_gfx_bridge->bus, PCI_DEVFN(0, 0),
-				  0x64, &nb_cntl);
-
-	if (!(nb_cntl & BIT(10))) {
-		dev_warn(&int_gfx_bridge->dev,
-			 FW_WARN "RS780: MSI for internal graphics disabled\n");
-		int_gfx_bridge->subordinate->bus_flags |= PCI_BUS_FLAGS_NO_MSI;
-	}
-}
-
-#define PCI_DEVICE_ID_AMD_RS780_P2P_INT_GFX	0x9602
-
-DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_AMD,
-			PCI_DEVICE_ID_AMD_RS780_P2P_INT_GFX,
-			rs780_int_gfx_disable_msi);
-/* wrong vendor ID on M4A785TD motherboard: */
-DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ASUSTEK,
-			PCI_DEVICE_ID_AMD_RS780_P2P_INT_GFX,
-			rs780_int_gfx_disable_msi);
-
 #endif /* CONFIG_PCI_MSI */
 
 #ifdef CONFIG_PCI_IOV

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: [Regression, post-rc2] Commit a5ee4eb7541 breaks OpenGL on RS780 (was: Re: Linux 2.6.34-rc3)
  2010-04-01 22:48       ` Jesse Barnes
@ 2010-04-01 23:23         ` Rafael J. Wysocki
  0 siblings, 0 replies; 242+ messages in thread
From: Rafael J. Wysocki @ 2010-04-01 23:23 UTC (permalink / raw)
  To: Jesse Barnes
  Cc: Linus Torvalds, Linux Kernel Mailing List, Dave Airlie,
	dri-devel, Linux PCI, Clemens Ladisch, Alex Deucher, stable,
	Greg KH

On Friday 02 April 2010, Jesse Barnes wrote:
> On Thu, 1 Apr 2010 09:29:23 -0700 (PDT)
> Linus Torvalds <torvalds@linux-foundation.org> wrote:
> 
> > 
> > 
> > On Thu, 1 Apr 2010, Rafael J. Wysocki wrote:
> > > 
> > > OK, I've verified that partial revert (below) is sufficient.
> > 
> > Hmm. Through the DRM merge I just did, this area actually conflicted, and 
> > the resolved version is now
> > 
> >         if ((rdev->family >= CHIP_RV380) &&
> >             (!(rdev->flags & RADEON_IS_IGP))) {
> > 
> > which presumably also fixes your issue?
> > 
> > [ Side note: somebody in the DRM tree seems to be way too used to LISP, 
> >   and thinks that adding parenthesis always improves the code ;-]
> > 
> > However, I do suspect that we should probably revert the quirk regardless 
> > as being useless (ie it probably was related to those IGP chips that 
> > apparently don't do MSI anyway).
> > 
> > So the patch that reverts the quirk by Clemens (to replace it with 
> > disabling MSI entirely when the AMD NB doesn't accept them) seems to be a 
> > good idea regardless, since it's apparently not just about gfx. Jesse?
> 
> Yeah, that sounds fine.  I can include it in my next pull req or you
> can just pick it up directly.

Not exactly that one, please, it's missing a quirk for the affected system.

I've just sent a corrected version, here:
https://patchwork.kernel.org/patch/90275/

Rafael

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: [Regression, post-rc2] Commit a5ee4eb7541 breaks OpenGL on RS780 (was: Re: Linux 2.6.34-rc3)
  2010-04-01 23:20                           ` Rafael J. Wysocki
@ 2010-04-02  0:23                             ` Linus Torvalds
  2010-04-02 16:46                               ` Rafael J. Wysocki
  0 siblings, 1 reply; 242+ messages in thread
From: Linus Torvalds @ 2010-04-02  0:23 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Alex Deucher, Clemens Ladisch, Linux PCI, Greg KH,
	Linux Kernel Mailing List, Jesse Barnes, dri-devel, stable,
	Dave Airlie



On Fri, 2 Apr 2010, Rafael J. Wysocki wrote:
> 
> Appended, with sign-offs and changelog.
> 
> ---
> Subject: PCI quirk: RS780/RS880: disable MSI completely

Hmm. Isn't this missing a

	From: Clemens Ladisch <clemens@ladisch.de>

too? Or was the original patch yours?

		Linus

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3)
  2010-04-02 18:09   ` Linus Torvalds
@ 2010-04-02 15:24     ` Andrew Morton
  2010-04-02 18:37       ` Linus Torvalds
  2010-04-06  8:53     ` Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3) KOSAKI Motohiro
  1 sibling, 1 reply; 242+ messages in thread
From: Andrew Morton @ 2010-04-02 15:24 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Borislav Petkov, Rik van Riel, Linux Kernel Mailing List,
	KOSAKI Motohiro, Lee Schermerhorn, Minchan Kim, Nick Piggin,
	Andrea Arcangeli, Hugh Dickins, sgunderson

On Fri, 2 Apr 2010 11:09:14 -0700 (PDT) Linus Torvalds <torvalds@linux-foundation.org> wrote:

> 
> I think this is likely due to the new scalable anon_vma linking by Rik. 

Similar to https://bugzilla.kernel.org/show_bug.cgi?id=15680

> Nothing else I can imagine should have introduced anything like it.
> 
> Rik: the picures have the information, but you need to look at several to 
> see both the oops and the backtrace. Here's a condensed version:
> 
>  shrink_all_memory ->
>    do_try_to_free_pages ->
>      shrink_zone ->
>        shrink_inactive_list ->
>          shrink_page_list ->
>            page_referenced
> 
> where page_referenced() oopses due page_referenced_anon() as per 
> Borislav's description below.
> 
> Added all the usual suspects to the Cc list. Left the full report appended 
> so that the new people don't have to search for it on lkml.
> 
> 		Linus
> 
> On Fri, 2 Apr 2010, Borislav Petkov wrote:
> > 
> > I've got the following oopsie two times now when hibernating - this
> > means, I don't get it everytime I hibernate but only sometimes, say once
> > in a blue moon.
> > 
> > And yeah, I couldn't catch it over serial console so I had to make ugly
> > pictures. By the way, the numbers in the filenames increment as I scroll
> > down the whole oops (yep, it hadn't completely frozen and I still could
> > do Shift->PgUp or Shift->PgDn on the console):
> > 
> > http://www.kernel.org/pub/linux/kernel/people/bp/
> > 
> > So, here's what I could decipher from the oopsie, someone else who's
> > more knowledgeable in mm, rmap and anon_vma's list traversal should be
> > able to tell what goes wrong there.
> > 
> > EIP is at page_referenced+0xee
> > 
> > which is
> > 
> > <disasm>
> >     10c4:	41 01 c4             	add    %eax,%r12d
> >     10c7:	83 7d cc 00          	cmpl   $0x0,-0x34(%rbp)
> >     10cb:	74 19                	je     10e6 <page_referenced+0xff>
> >     10cd:	4d 8b 6d 20          	mov    0x20(%r13),%r13
> >     10d1:	49 83 ed 20          	sub    $0x20,%r13
> > 
> >     10d5:	49 8b 45 20          	mov    0x20(%r13),%rax		    <--------------
> > 
> >     10d9:	0f 18 08             	prefetcht0 (%rax)
> >     10dc:	49 8d 45 20          	lea    0x20(%r13),%rax
> >     10e0:	48 39 45 80          	cmp    %rax,-0x80(%rbp)
> > </disasm>
> > 
> > 
> > Corresponding asm:
> > 
> > <asm>
> > 	.loc 1 496 0
> > 	movq	32(%r13), %r13	# <variable>.same_anon_vma.next, __mptr.451
> > .LVL295:
> > 	subq	$32, %r13	#, avc
> > .LVL296:
> > .L184:
> > .LBE1278:
> > 	movq	32(%r13), %rax	# <variable>.same_anon_vma.next, <variable>.same_anon_vma.next			<----------------
> > 	prefetcht0	(%rax)	# <variable>.same_anon_vma.next
> > 	leaq	32(%r13), %rax	#, tmp97
> > 	cmpq	%rax, -128(%rbp)	# tmp97, %sfp
> > 	jne	.L187	#,
> > .L186:
> > 	.loc 1 514 0
> > 	movq	%r14, %rdi	# anon_vma,
> > 	call	page_unlock_anon_vma	#
> > </asm>
> > 
> > 
> > and the NULL pointer in question is being written into %r13 and then 32
> > is subtracted from it (I'm guessing container_of()). This is consistent
> > with the register snapshot - %r13 contains 0xffffffffffffffe0 which is
> > -32 and with the code dump in the oops, in CIMG1640.JPG code points to
> > opcode 49 8b 45 20.
> > 
> > Which is the following piece of code in <mm/rmap.c:page_referenced_anon()>.
> > 
> > <source>
> > 
> > 	mapcount = page_mapcount(page);
> > 	list_for_each_entry(avc, &anon_vma->head, same_anon_vma) {
> > 		struct vm_area_struct *vma = avc->vma;
> > 		unsigned long address = vma_address(page, vma);
> > 		if (address == -EFAULT)
> > 			continue;
> > 
> > </source>
> > 
> > which tells us that same_anon_vma.next is NULL. Hmm...
> > 
> > -- 
> > Regards/Gruss,
> >     Boris.
> > 

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: [Regression, post-rc2] Commit a5ee4eb7541 breaks OpenGL on RS780 (was: Re: Linux 2.6.34-rc3)
  2010-04-02  0:23                             ` Linus Torvalds
@ 2010-04-02 16:46                               ` Rafael J. Wysocki
  2010-04-03 18:08                                 ` Clemens Ladisch
  0 siblings, 1 reply; 242+ messages in thread
From: Rafael J. Wysocki @ 2010-04-02 16:46 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Alex Deucher, Clemens Ladisch, Linux PCI, Greg KH,
	Linux Kernel Mailing List, Jesse Barnes, dri-devel, stable,
	Dave Airlie

On Friday 02 April 2010, Linus Torvalds wrote:
> 
> On Fri, 2 Apr 2010, Rafael J. Wysocki wrote:
> > 
> > Appended, with sign-offs and changelog.
> > 
> > ---
> > Subject: PCI quirk: RS780/RS880: disable MSI completely
> 
> Hmm. Isn't this missing a
> 
> 	From: Clemens Ladisch <clemens@ladisch.de>
> 
> too?

Ouch, yes it is, sorry.

This one should be complete.

---
From: Clemens Ladisch <clemens@ladisch.de>
Subject: PCI quirk: RS780/RS880: disable MSI completely

The missing initialization of the nb_cntl.strap_msi_enable does not
seem to be the only problem that prevents MSI, so that quirk is not
sufficient to enable MSI on all machines.  To be safe, disable MSI
unconditionally for the internal graphics and HDMI audio on these
chipsets.

[rjw: Added the PCI_VENDOR_ID_AI quirk.]

Signed-off-by: Clemens Ladisch <clemens@ladisch.de>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
---
 drivers/pci/quirks.c |   36 +++---------------------------------
 1 file changed, 3 insertions(+), 33 deletions(-)

Index: linux-2.6/drivers/pci/quirks.c
===================================================================
--- linux-2.6.orig/drivers/pci/quirks.c
+++ linux-2.6/drivers/pci/quirks.c
@@ -2123,6 +2123,9 @@ static void __devinit quirk_disable_msi(
 	}
 }
 DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_8131_BRIDGE, quirk_disable_msi);
+DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_AMD, 0x9602, quirk_disable_msi);
+DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ASUSTEK, 0x9602, quirk_disable_msi);
+DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_AI, 0x9602, quirk_disable_msi);
 DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_VIA, 0xa238, quirk_disable_msi);
 
 /* Go through the list of Hypertransport capabilities and
@@ -2495,39 +2498,6 @@ DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_AT
 DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ATI, 0x4375,
 			quirk_msi_intx_disable_bug);
 
-/*
- * MSI does not work with the AMD RS780/RS880 internal graphics and HDMI audio
- * devices unless the BIOS has initialized the nb_cntl.strap_msi_enable bit.
- */
-static void __init rs780_int_gfx_disable_msi(struct pci_dev *int_gfx_bridge)
-{
-	u32 nb_cntl;
-
-	if (!int_gfx_bridge->subordinate)
-		return;
-
-	pci_bus_write_config_dword(int_gfx_bridge->bus, PCI_DEVFN(0, 0),
-				   0x60, 0);
-	pci_bus_read_config_dword(int_gfx_bridge->bus, PCI_DEVFN(0, 0),
-				  0x64, &nb_cntl);
-
-	if (!(nb_cntl & BIT(10))) {
-		dev_warn(&int_gfx_bridge->dev,
-			 FW_WARN "RS780: MSI for internal graphics disabled\n");
-		int_gfx_bridge->subordinate->bus_flags |= PCI_BUS_FLAGS_NO_MSI;
-	}
-}
-
-#define PCI_DEVICE_ID_AMD_RS780_P2P_INT_GFX	0x9602
-
-DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_AMD,
-			PCI_DEVICE_ID_AMD_RS780_P2P_INT_GFX,
-			rs780_int_gfx_disable_msi);
-/* wrong vendor ID on M4A785TD motherboard: */
-DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ASUSTEK,
-			PCI_DEVICE_ID_AMD_RS780_P2P_INT_GFX,
-			rs780_int_gfx_disable_msi);
-
 #endif /* CONFIG_PCI_MSI */
 
 #ifdef CONFIG_PCI_IOV

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3)
  2010-03-30 17:50 Linux 2.6.34-rc3 Linus Torvalds
  2010-03-30 21:16 ` [Regression, post-rc2] Commit a5ee4eb7541 breaks OpenGL on RS780 (was: Re: Linux 2.6.34-rc3) Rafael J. Wysocki
@ 2010-04-02 17:59 ` Borislav Petkov
  2010-04-02 18:09   ` Linus Torvalds
  1 sibling, 1 reply; 242+ messages in thread
From: Borislav Petkov @ 2010-04-02 17:59 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton; +Cc: Linux Kernel Mailing List

Hi,

I've got the following oopsie two times now when hibernating - this
means, I don't get it everytime I hibernate but only sometimes, say once
in a blue moon.

And yeah, I couldn't catch it over serial console so I had to make ugly
pictures. By the way, the numbers in the filenames increment as I scroll
down the whole oops (yep, it hadn't completely frozen and I still could
do Shift->PgUp or Shift->PgDn on the console):

http://www.kernel.org/pub/linux/kernel/people/bp/

So, here's what I could decipher from the oopsie, someone else who's
more knowledgeable in mm, rmap and anon_vma's list traversal should be
able to tell what goes wrong there.

EIP is at page_referenced+0xee

which is

<disasm>
    10c4:	41 01 c4             	add    %eax,%r12d
    10c7:	83 7d cc 00          	cmpl   $0x0,-0x34(%rbp)
    10cb:	74 19                	je     10e6 <page_referenced+0xff>
    10cd:	4d 8b 6d 20          	mov    0x20(%r13),%r13
    10d1:	49 83 ed 20          	sub    $0x20,%r13

    10d5:	49 8b 45 20          	mov    0x20(%r13),%rax		    <--------------

    10d9:	0f 18 08             	prefetcht0 (%rax)
    10dc:	49 8d 45 20          	lea    0x20(%r13),%rax
    10e0:	48 39 45 80          	cmp    %rax,-0x80(%rbp)
</disasm>


Corresponding asm:

<asm>
	.loc 1 496 0
	movq	32(%r13), %r13	# <variable>.same_anon_vma.next, __mptr.451
.LVL295:
	subq	$32, %r13	#, avc
.LVL296:
.L184:
.LBE1278:
	movq	32(%r13), %rax	# <variable>.same_anon_vma.next, <variable>.same_anon_vma.next			<----------------
	prefetcht0	(%rax)	# <variable>.same_anon_vma.next
	leaq	32(%r13), %rax	#, tmp97
	cmpq	%rax, -128(%rbp)	# tmp97, %sfp
	jne	.L187	#,
.L186:
	.loc 1 514 0
	movq	%r14, %rdi	# anon_vma,
	call	page_unlock_anon_vma	#
</asm>


and the NULL pointer in question is being written into %r13 and then 32
is subtracted from it (I'm guessing container_of()). This is consistent
with the register snapshot - %r13 contains 0xffffffffffffffe0 which is
-32 and with the code dump in the oops, in CIMG1640.JPG code points to
opcode 49 8b 45 20.

Which is the following piece of code in <mm/rmap.c:page_referenced_anon()>.

<source>

	mapcount = page_mapcount(page);
	list_for_each_entry(avc, &anon_vma->head, same_anon_vma) {
		struct vm_area_struct *vma = avc->vma;
		unsigned long address = vma_address(page, vma);
		if (address == -EFAULT)
			continue;

</source>

which tells us that same_anon_vma.next is NULL. Hmm...

-- 
Regards/Gruss,
    Boris.

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3)
  2010-04-02 17:59 ` Ugly rmap NULL ptr deref oopsie on hibernate (was " Borislav Petkov
@ 2010-04-02 18:09   ` Linus Torvalds
  2010-04-02 15:24     ` Andrew Morton
  2010-04-06  8:53     ` Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3) KOSAKI Motohiro
  0 siblings, 2 replies; 242+ messages in thread
From: Linus Torvalds @ 2010-04-02 18:09 UTC (permalink / raw)
  To: Borislav Petkov, Rik van Riel
  Cc: Andrew Morton, Linux Kernel Mailing List, KOSAKI Motohiro,
	Lee Schermerhorn, Minchan Kim, Nick Piggin, Andrea Arcangeli,
	Hugh Dickins


I think this is likely due to the new scalable anon_vma linking by Rik. 
Nothing else I can imagine should have introduced anything like it.

Rik: the picures have the information, but you need to look at several to 
see both the oops and the backtrace. Here's a condensed version:

 shrink_all_memory ->
   do_try_to_free_pages ->
     shrink_zone ->
       shrink_inactive_list ->
         shrink_page_list ->
           page_referenced

where page_referenced() oopses due page_referenced_anon() as per 
Borislav's description below.

Added all the usual suspects to the Cc list. Left the full report appended 
so that the new people don't have to search for it on lkml.

		Linus

On Fri, 2 Apr 2010, Borislav Petkov wrote:
> 
> I've got the following oopsie two times now when hibernating - this
> means, I don't get it everytime I hibernate but only sometimes, say once
> in a blue moon.
> 
> And yeah, I couldn't catch it over serial console so I had to make ugly
> pictures. By the way, the numbers in the filenames increment as I scroll
> down the whole oops (yep, it hadn't completely frozen and I still could
> do Shift->PgUp or Shift->PgDn on the console):
> 
> http://www.kernel.org/pub/linux/kernel/people/bp/
> 
> So, here's what I could decipher from the oopsie, someone else who's
> more knowledgeable in mm, rmap and anon_vma's list traversal should be
> able to tell what goes wrong there.
> 
> EIP is at page_referenced+0xee
> 
> which is
> 
> <disasm>
>     10c4:	41 01 c4             	add    %eax,%r12d
>     10c7:	83 7d cc 00          	cmpl   $0x0,-0x34(%rbp)
>     10cb:	74 19                	je     10e6 <page_referenced+0xff>
>     10cd:	4d 8b 6d 20          	mov    0x20(%r13),%r13
>     10d1:	49 83 ed 20          	sub    $0x20,%r13
> 
>     10d5:	49 8b 45 20          	mov    0x20(%r13),%rax		    <--------------
> 
>     10d9:	0f 18 08             	prefetcht0 (%rax)
>     10dc:	49 8d 45 20          	lea    0x20(%r13),%rax
>     10e0:	48 39 45 80          	cmp    %rax,-0x80(%rbp)
> </disasm>
> 
> 
> Corresponding asm:
> 
> <asm>
> 	.loc 1 496 0
> 	movq	32(%r13), %r13	# <variable>.same_anon_vma.next, __mptr.451
> .LVL295:
> 	subq	$32, %r13	#, avc
> .LVL296:
> .L184:
> .LBE1278:
> 	movq	32(%r13), %rax	# <variable>.same_anon_vma.next, <variable>.same_anon_vma.next			<----------------
> 	prefetcht0	(%rax)	# <variable>.same_anon_vma.next
> 	leaq	32(%r13), %rax	#, tmp97
> 	cmpq	%rax, -128(%rbp)	# tmp97, %sfp
> 	jne	.L187	#,
> .L186:
> 	.loc 1 514 0
> 	movq	%r14, %rdi	# anon_vma,
> 	call	page_unlock_anon_vma	#
> </asm>
> 
> 
> and the NULL pointer in question is being written into %r13 and then 32
> is subtracted from it (I'm guessing container_of()). This is consistent
> with the register snapshot - %r13 contains 0xffffffffffffffe0 which is
> -32 and with the code dump in the oops, in CIMG1640.JPG code points to
> opcode 49 8b 45 20.
> 
> Which is the following piece of code in <mm/rmap.c:page_referenced_anon()>.
> 
> <source>
> 
> 	mapcount = page_mapcount(page);
> 	list_for_each_entry(avc, &anon_vma->head, same_anon_vma) {
> 		struct vm_area_struct *vma = avc->vma;
> 		unsigned long address = vma_address(page, vma);
> 		if (address == -EFAULT)
> 			continue;
> 
> </source>
> 
> which tells us that same_anon_vma.next is NULL. Hmm...
> 
> -- 
> Regards/Gruss,
>     Boris.
> 

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3)
  2010-04-02 15:24     ` Andrew Morton
@ 2010-04-02 18:37       ` Linus Torvalds
  2010-04-02 22:01         ` Rik van Riel
  0 siblings, 1 reply; 242+ messages in thread
From: Linus Torvalds @ 2010-04-02 18:37 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Borislav Petkov, Rik van Riel, Linux Kernel Mailing List,
	KOSAKI Motohiro, Lee Schermerhorn, Minchan Kim, Nick Piggin,
	Andrea Arcangeli, Hugh Dickins, sgunderson



On Fri, 2 Apr 2010, Andrew Morton wrote:

> On Fri, 2 Apr 2010 11:09:14 -0700 (PDT) Linus Torvalds <torvalds@linux-foundation.org> wrote:
> 
> > 
> > I think this is likely due to the new scalable anon_vma linking by Rik. 
> 
> Similar to https://bugzilla.kernel.org/show_bug.cgi?id=15680

Yup, looks like the same thing, except that bugzilla entry was due to 
swapping rather than hibernation and memory shrinking. But same end 
result, just different reasons for why we were trying to shrink the page 
lists.

		Linus

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3)
  2010-04-02 18:37       ` Linus Torvalds
@ 2010-04-02 22:01         ` Rik van Riel
  2010-04-03  0:19           ` Linus Torvalds
  2010-04-04 16:12           ` Minchan Kim
  0 siblings, 2 replies; 242+ messages in thread
From: Rik van Riel @ 2010-04-02 22:01 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, Borislav Petkov, Linux Kernel Mailing List,
	KOSAKI Motohiro, Lee Schermerhorn, Minchan Kim, Nick Piggin,
	Andrea Arcangeli, Hugh Dickins, sgunderson

On 04/02/2010 02:37 PM, Linus Torvalds wrote:
> On Fri, 2 Apr 2010, Andrew Morton wrote:
>> On Fri, 2 Apr 2010 11:09:14 -0700 (PDT) Linus Torvalds<torvalds@linux-foundation.org>  wrote:
>>
>>>
>>> I think this is likely due to the new scalable anon_vma linking by Rik.
>>
>> Similar to https://bugzilla.kernel.org/show_bug.cgi?id=15680
>
> Yup, looks like the same thing, except that bugzilla entry was due to
> swapping rather than hibernation and memory shrinking. But same end
> result, just different reasons for why we were trying to shrink the page
> lists.

Interesting that it is a null pointer dereference, given
that we do not zero out the anon_vma_chain structs before
freeing them.

Page_referenced_anon() takes the anon_vma->lock before
walking the list.  The three places where we modify the
anon_vma_chain->same_anon_vma list, we also hold the
lock.

No doubt something in mm/ is doing something silly, but
I have not found anything yet :(

If I had to guess, I'd say maybe we got one of the
mprotect & vma_adjust cases wrong.  Maybe a page stayed
around in the LRU (and in a process?) after its anon_vma
already got freed?

There has to be a reason why a very heavy AIM7 workload
and some other stress tests did not trigger it, but a few
people are able to trigger it on their systems...

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3)
  2010-04-02 22:01         ` Rik van Riel
@ 2010-04-03  0:19           ` Linus Torvalds
  2010-04-04 16:12           ` Minchan Kim
  1 sibling, 0 replies; 242+ messages in thread
From: Linus Torvalds @ 2010-04-03  0:19 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Andrew Morton, Borislav Petkov, Linux Kernel Mailing List,
	KOSAKI Motohiro, Lee Schermerhorn, Minchan Kim, Nick Piggin,
	Andrea Arcangeli, Hugh Dickins, sgunderson



On Fri, 2 Apr 2010, Rik van Riel wrote:
> 
> Interesting that it is a null pointer dereference, given
> that we do not zero out the anon_vma_chain structs before
> freeing them.
> 
> Page_referenced_anon() takes the anon_vma->lock before
> walking the list.  The three places where we modify the
> anon_vma_chain->same_anon_vma list, we also hold the
> lock.

So let's look at the individual anon_vma_chain entries instead.

What is the protection of the 'vma->anon_vma_chain' list? In 
anon_vma_prepare(), the code implies that it is the page_table_lock, but 
what about anon_vma_clone()? If I'm reading it correctly, it is some odd 
mix of "mmap_sem held for writing" or "mmap_sem held for reading _and_ 
page_table_lock". And then we have the exit case that apparently has no 
locking at all, but that should hopefully be single-threaded.

That thing is subtle. A few more comments about the locking would be good, 
so that people like me wouldn't have to try to guess the rules from 
reading the source.

> There has to be a reason why a very heavy AIM7 workload
> and some other stress tests did not trigger it, but a few
> people are able to trigger it on their systems...

I don't think AIM7 is at all a very interesting workload, and not likely 
to stress anything at all. Did your AIM7 test actually cause heavy 
swapping? I doubt it. 

Page swapout is where a lot of the magic happens, since that happens 
without mmap_sem held etc. 

			Linus

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: [Regression, post-rc2] Commit a5ee4eb7541 breaks OpenGL on RS780 (was: Re: Linux 2.6.34-rc3)
  2010-04-02 16:46                               ` Rafael J. Wysocki
@ 2010-04-03 18:08                                 ` Clemens Ladisch
  2010-04-03 19:33                                   ` Rafael J. Wysocki
  0 siblings, 1 reply; 242+ messages in thread
From: Clemens Ladisch @ 2010-04-03 18:08 UTC (permalink / raw)
  To: Rafael J. Wysocki, Linus Torvalds
  Cc: Alex Deucher, Linux PCI, Greg KH, Linux Kernel Mailing List,
	Jesse Barnes, dri-devel, stable, Dave Airlie

Rafael J. Wysocki wrote:
> From: Clemens Ladisch <clemens@ladisch.de>
> Subject: PCI quirk: RS780/RS880: disable MSI completely
> 
> The missing initialization of the nb_cntl.strap_msi_enable does not
> seem to be the only problem that prevents MSI, so that quirk is not
> sufficient to enable MSI on all machines.  To be safe, disable MSI
> unconditionally for the internal graphics and HDMI audio on these
> chipsets.
> 
> [rjw: Added the PCI_VENDOR_ID_AI quirk.]
> ...
> +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_AMD, 0x9602, quirk_disable_msi);
> +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ASUSTEK, 0x9602, quirk_disable_msi);
> +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_AI, 0x9602, quirk_disable_msi);

I fear I have to NACK this.  The fact that two OEMs have changed the vendor
ID makes it likely that this is a bug in AMD's template BIOS code, and that
we will see the same problem on other systems using other vendor IDs.

So we should not use the vendor ID of device 0x9602 to declare the quirk, but
use some other device with an ID that is known to be correct.  We already
access the configuration space of the host bridge, so we should use that.

Furthermore, the quirk in my first patch was never run at all on the ALi
system, so it is probable that the nb_cntl.strap_msi_enable detection
would actually work.  Rafael, please test this patch; if it doesn't work
on your system, we can still remove the check for the strap_msi_enable bit.

==========

Subject: PCI quirk: RS780/RS880: work around wrong vendor IDs of RS780 bridge

On many RS780 systems, the vendor ID of the PCI/PCI bridge for the
internal graphics is set to that of the mainboard vendor, so the quirk
would not match and failed to notice the disabled MSI.

Since we do not know in advance all possible vendor IDs, we have to
declare the quirk on another device with an ID that is known to be
correct, and use that as a stepping stone to find the PCI/PCI bridge,
if present.

Signed-off-by: Clemens Ladisch <clemens@ladisch.de>
Cc: <stable@kernel.org>

--- a/drivers/pci/quirks.c
+++ b/drivers/pci/quirks.c
@@ -2483,34 +2483,38 @@ DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_AT
  * MSI does not work with the AMD RS780/RS880 internal graphics and HDMI audio
  * devices unless the BIOS has initialized the nb_cntl.strap_msi_enable bit.
  */
-static void __init rs780_int_gfx_disable_msi(struct pci_dev *int_gfx_bridge)
+static void __init rs780_int_gfx_disable_msi(struct pci_dev *host_bridge)
 {
+	struct pci_dev *int_gfx_bridge;
 	u32 nb_cntl;
 
-	if (!int_gfx_bridge->subordinate)
+	/*
+	 * Many OEMs change the vendor ID of the internal graphics PCI/PCI
+	 * bridge, so we use the possible vendor/device IDs of the host bridge
+	 * for the declared quirk, and search for the PCI/PCI bridge by slot
+	 * number.
+	 */
+	int_gfx_bridge = pci_get_slot(host_bridge->bus, PCI_DEVFN(1, 0));
+	if (!int_gfx_bridge)
 		return;
+	if (int_gfx_bridge->device != 0x9602 || !int_gfx_bridge->subordinate)
+		goto out;
 
-	pci_bus_write_config_dword(int_gfx_bridge->bus, PCI_DEVFN(0, 0),
-				   0x60, 0);
-	pci_bus_read_config_dword(int_gfx_bridge->bus, PCI_DEVFN(0, 0),
-				  0x64, &nb_cntl);
+	pci_write_config_dword(host_bridge, 0x60, 0);
+	pci_read_config_dword(host_bridge, 0x64, &nb_cntl);
 
 	if (!(nb_cntl & BIT(10))) {
 		dev_warn(&int_gfx_bridge->dev,
 			 FW_WARN "RS780: MSI for internal graphics disabled\n");
 		int_gfx_bridge->subordinate->bus_flags |= PCI_BUS_FLAGS_NO_MSI;
 	}
-}
 
-#define PCI_DEVICE_ID_AMD_RS780_P2P_INT_GFX	0x9602
+out:
+	pci_dev_put(int_gfx_bridge);
+}
 
-DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_AMD,
-			PCI_DEVICE_ID_AMD_RS780_P2P_INT_GFX,
-			rs780_int_gfx_disable_msi);
-/* wrong vendor ID on M4A785TD motherboard: */
-DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ASUSTEK,
-			PCI_DEVICE_ID_AMD_RS780_P2P_INT_GFX,
-			rs780_int_gfx_disable_msi);
+DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_AMD, 0x9600, rs780_int_gfx_disable_msi);
+DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_AMD, 0x9601, rs780_int_gfx_disable_msi);
 
 #endif /* CONFIG_PCI_MSI */
 

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: [Regression, post-rc2] Commit a5ee4eb7541 breaks OpenGL on RS780 (was: Re: Linux 2.6.34-rc3)
  2010-04-03 18:08                                 ` Clemens Ladisch
@ 2010-04-03 19:33                                   ` Rafael J. Wysocki
  0 siblings, 0 replies; 242+ messages in thread
From: Rafael J. Wysocki @ 2010-04-03 19:33 UTC (permalink / raw)
  To: Clemens Ladisch
  Cc: Linus Torvalds, Alex Deucher, Linux PCI, Greg KH,
	Linux Kernel Mailing List, Jesse Barnes, dri-devel, stable,
	Dave Airlie

On Saturday 03 April 2010, Clemens Ladisch wrote:
> Rafael J. Wysocki wrote:
> > From: Clemens Ladisch <clemens@ladisch.de>
> > Subject: PCI quirk: RS780/RS880: disable MSI completely
> > 
> > The missing initialization of the nb_cntl.strap_msi_enable does not
> > seem to be the only problem that prevents MSI, so that quirk is not
> > sufficient to enable MSI on all machines.  To be safe, disable MSI
> > unconditionally for the internal graphics and HDMI audio on these
> > chipsets.
> > 
> > [rjw: Added the PCI_VENDOR_ID_AI quirk.]
> > ...
> > +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_AMD, 0x9602, quirk_disable_msi);
> > +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ASUSTEK, 0x9602, quirk_disable_msi);
> > +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_AI, 0x9602, quirk_disable_msi);
> 
> I fear I have to NACK this.

I'm afraid it's too late, the patch has been merged.

> The fact that two OEMs have changed the vendor
> ID makes it likely that this is a bug in AMD's template BIOS code, and that
> we will see the same problem on other systems using other vendor IDs.
> 
> So we should not use the vendor ID of device 0x9602 to declare the quirk, but
> use some other device with an ID that is known to be correct.  We already
> access the configuration space of the host bridge, so we should use that.
> 
> Furthermore, the quirk in my first patch was never run at all on the ALi
> system, so it is probable that the nb_cntl.strap_msi_enable detection
> would actually work.  Rafael, please test this patch; if it doesn't work
> on your system, we can still remove the check for the strap_msi_enable bit.
> 
> ==========
> 
> Subject: PCI quirk: RS780/RS880: work around wrong vendor IDs of RS780 bridge
> 
> On many RS780 systems, the vendor ID of the PCI/PCI bridge for the
> internal graphics is set to that of the mainboard vendor, so the quirk
> would not match and failed to notice the disabled MSI.
> 
> Since we do not know in advance all possible vendor IDs, we have to
> declare the quirk on another device with an ID that is known to be
> correct, and use that as a stepping stone to find the PCI/PCI bridge,
> if present.
> 
> Signed-off-by: Clemens Ladisch <clemens@ladisch.de>
> Cc: <stable@kernel.org>

Yes, this works (after reverting commit
5193d7a7f500cfbbfc0de221e808208199723521 and removing the
(rdev->flags & RADEON_IS_IGP) test from radeon_irq_kms_init()).

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3)
  2010-04-02 22:01         ` Rik van Riel
  2010-04-03  0:19           ` Linus Torvalds
@ 2010-04-04 16:12           ` Minchan Kim
  2010-04-04 17:24             ` Rik van Riel
  2010-04-04 23:09             ` [PATCH] rmap: fix anon_vma_fork() memory leak Rik van Riel
  1 sibling, 2 replies; 242+ messages in thread
From: Minchan Kim @ 2010-04-04 16:12 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Linus Torvalds, Andrew Morton, Borislav Petkov,
	Linux Kernel Mailing List, KOSAKI Motohiro, Lee Schermerhorn,
	Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson

Hi, Rik. 

On Fri, 2010-04-02 at 18:01 -0400, Rik van Riel wrote:
> On 04/02/2010 02:37 PM, Linus Torvalds wrote:
> > On Fri, 2 Apr 2010, Andrew Morton wrote:
> >> On Fri, 2 Apr 2010 11:09:14 -0700 (PDT) Linus Torvalds<torvalds@linux-foundation.org>  wrote:
> >>
> >>>
> >>> I think this is likely due to the new scalable anon_vma linking by Rik.
> >>
> >> Similar to https://bugzilla.kernel.org/show_bug.cgi?id=15680
> >
> > Yup, looks like the same thing, except that bugzilla entry was due to
> > swapping rather than hibernation and memory shrinking. But same end
> > result, just different reasons for why we were trying to shrink the page
> > lists.
> 
> Interesting that it is a null pointer dereference, given
> that we do not zero out the anon_vma_chain structs before
> freeing them.
> 
> Page_referenced_anon() takes the anon_vma->lock before
> walking the list.  The three places where we modify the
> anon_vma_chain->same_anon_vma list, we also hold the
> lock.
> 
> No doubt something in mm/ is doing something silly, but
> I have not found anything yet :(
> 
> If I had to guess, I'd say maybe we got one of the
> mprotect & vma_adjust cases wrong.  Maybe a page stayed
> around in the LRU (and in a process?) after its anon_vma
> already got freed?

While I review the code again due to this BUG, I found some strange
thing. 

In anon_vma_fork, if anon_vma_clone is successful but anon_vma_alloc is
failed, what happens? Parent VMA's anon_vmas have anon_vma_chain which
has vma which is destroyed. 
I couldn't find any clean routine to remove this garbage. 
I am missing something?

But I think it isn't related to this bug because oops point is not
vma_address but anon_vma_chain.next.



-- 
Kind regards,
Minchan Kim



^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3)
  2010-04-04 16:12           ` Minchan Kim
@ 2010-04-04 17:24             ` Rik van Riel
  2010-04-04 23:09             ` [PATCH] rmap: fix anon_vma_fork() memory leak Rik van Riel
  1 sibling, 0 replies; 242+ messages in thread
From: Rik van Riel @ 2010-04-04 17:24 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Linus Torvalds, Andrew Morton, Borislav Petkov,
	Linux Kernel Mailing List, KOSAKI Motohiro, Lee Schermerhorn,
	Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson

On 04/04/2010 12:12 PM, Minchan Kim wrote:

> While I review the code again due to this BUG, I found some strange
> thing.
>
> In anon_vma_fork, if anon_vma_clone is successful but anon_vma_alloc is
> failed, what happens? Parent VMA's anon_vmas have anon_vma_chain which
> has vma which is destroyed.
> I couldn't find any clean routine to remove this garbage.
> I am missing something?

Good catch.  The parent VMA's anon_vmas will get delinked
eventually, but we need to get rid of the newly allocated
child anon_vmas.  You found a hopefully rare memory leak...

We need a call to unlink_anon_vmas(vma) at the error label
to do that.

> But I think it isn't related to this bug because oops point is not
> vma_address but anon_vma_chain.next.

Agreed, it's probably not it.

^ permalink raw reply	[flat|nested] 242+ messages in thread

* [PATCH] rmap: fix anon_vma_fork() memory leak
  2010-04-04 16:12           ` Minchan Kim
  2010-04-04 17:24             ` Rik van Riel
@ 2010-04-04 23:09             ` Rik van Riel
  2010-04-04 23:56               ` Minchan Kim
  2010-04-05 15:37               ` Linus Torvalds
  1 sibling, 2 replies; 242+ messages in thread
From: Rik van Riel @ 2010-04-04 23:09 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Linus Torvalds, Andrew Morton, Borislav Petkov,
	Linux Kernel Mailing List, KOSAKI Motohiro, Lee Schermerhorn,
	Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson

Fix a memory leak in anon_vma_fork(), where we fail to tear down the
anon_vmas attached to the new VMA in case setting up the new anon_vma
fails.

Reported-by: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Rik van Riel <riel@redhat.com>

diff --git a/mm/rmap.c b/mm/rmap.c
index fcd593c..fb7ce99 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -231,6 +231,7 @@ int anon_vma_fork(struct vm_area_struct *vma, struct vm_area_struct *pvma)
 
  out_error_free_anon_vma:
 	anon_vma_free(anon_vma);
+	unlink_anon_vmas(vma);
  out_error:
 	return -ENOMEM;
 }


^ permalink raw reply related	[flat|nested] 242+ messages in thread

* Re: [PATCH] rmap: fix anon_vma_fork() memory leak
  2010-04-04 23:09             ` [PATCH] rmap: fix anon_vma_fork() memory leak Rik van Riel
@ 2010-04-04 23:56               ` Minchan Kim
  2010-04-05 15:37               ` Linus Torvalds
  1 sibling, 0 replies; 242+ messages in thread
From: Minchan Kim @ 2010-04-04 23:56 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Linus Torvalds, Andrew Morton, Borislav Petkov,
	Linux Kernel Mailing List, KOSAKI Motohiro, Lee Schermerhorn,
	Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson

On Mon, Apr 5, 2010 at 8:09 AM, Rik van Riel <riel@redhat.com> wrote:
> Fix a memory leak in anon_vma_fork(), where we fail to tear down the
> anon_vmas attached to the new VMA in case setting up the new anon_vma
> fails.
>
> Reported-by: Minchan Kim <minchan.kim@gmail.com>
> Signed-off-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: [PATCH] rmap: fix anon_vma_fork() memory leak
  2010-04-04 23:09             ` [PATCH] rmap: fix anon_vma_fork() memory leak Rik van Riel
  2010-04-04 23:56               ` Minchan Kim
@ 2010-04-05 15:37               ` Linus Torvalds
  2010-04-05 15:48                 ` Minchan Kim
                                   ` (2 more replies)
  1 sibling, 3 replies; 242+ messages in thread
From: Linus Torvalds @ 2010-04-05 15:37 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Minchan Kim, Andrew Morton, Borislav Petkov,
	Linux Kernel Mailing List, KOSAKI Motohiro, Lee Schermerhorn,
	Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson



On Sun, 4 Apr 2010, Rik van Riel wrote:
>
> Fix a memory leak in anon_vma_fork(), where we fail to tear down the
> anon_vmas attached to the new VMA in case setting up the new anon_vma
> fails.
> 
> Reported-by: Minchan Kim <minchan.kim@gmail.com>
> Signed-off-by: Rik van Riel <riel@redhat.com>
> Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
> ---
> 
> diff --git a/mm/rmap.c b/mm/rmap.c
> index fcd593c..fb7ce99 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -231,6 +231,7 @@ int anon_vma_fork(struct vm_area_struct *vma, struct vm_area_struct *pvma)
>  
>   out_error_free_anon_vma:
>  	anon_vma_free(anon_vma);
> +	unlink_anon_vmas(vma);
>   out_error:
>  	return -ENOMEM;
>  }

This looks _very_ wrong to me.

Shouldn't the unlink_anon_vmas() be in the "out_error" case? IOW, we 
should do it even if the "anon_vma_alloc()" failed, nbot just if the 
"anon_vma_chain_alloc()" failed?

No?

What am I missing?

			Linus

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: [PATCH] rmap: fix anon_vma_fork() memory leak
  2010-04-05 15:37               ` Linus Torvalds
@ 2010-04-05 15:48                 ` Minchan Kim
  2010-04-05 16:04                 ` Rik van Riel
  2010-04-05 16:13                 ` [PATCH -v2] " Rik van Riel
  2 siblings, 0 replies; 242+ messages in thread
From: Minchan Kim @ 2010-04-05 15:48 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Rik van Riel, Andrew Morton, Borislav Petkov,
	Linux Kernel Mailing List, KOSAKI Motohiro, Lee Schermerhorn,
	Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson

On Tue, Apr 6, 2010 at 12:37 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
>
> On Sun, 4 Apr 2010, Rik van Riel wrote:
>>
>> Fix a memory leak in anon_vma_fork(), where we fail to tear down the
>> anon_vmas attached to the new VMA in case setting up the new anon_vma
>> fails.
>>
>> Reported-by: Minchan Kim <minchan.kim@gmail.com>
>> Signed-off-by: Rik van Riel <riel@redhat.com>
>> Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
>> ---
>>
>> diff --git a/mm/rmap.c b/mm/rmap.c
>> index fcd593c..fb7ce99 100644
>> --- a/mm/rmap.c
>> +++ b/mm/rmap.c
>> @@ -231,6 +231,7 @@ int anon_vma_fork(struct vm_area_struct *vma, struct vm_area_struct *pvma)
>>
>>   out_error_free_anon_vma:
>>       anon_vma_free(anon_vma);
>> +     unlink_anon_vmas(vma);
>>   out_error:
>>       return -ENOMEM;
>>  }
>
> This looks _very_ wrong to me.
>
> Shouldn't the unlink_anon_vmas() be in the "out_error" case? IOW, we
> should do it even if the "anon_vma_alloc()" failed, nbot just if the
> "anon_vma_chain_alloc()" failed?
>
> No?
>
> What am I missing?

Indeed. You're right.
I should have been reviewed more carefully.



-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: [PATCH] rmap: fix anon_vma_fork() memory leak
  2010-04-05 15:37               ` Linus Torvalds
  2010-04-05 15:48                 ` Minchan Kim
@ 2010-04-05 16:04                 ` Rik van Riel
  2010-04-05 16:13                 ` [PATCH -v2] " Rik van Riel
  2 siblings, 0 replies; 242+ messages in thread
From: Rik van Riel @ 2010-04-05 16:04 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Minchan Kim, Andrew Morton, Borislav Petkov,
	Linux Kernel Mailing List, KOSAKI Motohiro, Lee Schermerhorn,
	Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson

On 04/05/2010 11:37 AM, Linus Torvalds wrote:

> This looks _very_ wrong to me.
>
> Shouldn't the unlink_anon_vmas() be in the "out_error" case?

Indeed it should.  I've had my mind somewhere else this weekend :/

New patch in the next mail.

^ permalink raw reply	[flat|nested] 242+ messages in thread

* [PATCH -v2] rmap: fix anon_vma_fork() memory leak
  2010-04-05 15:37               ` Linus Torvalds
  2010-04-05 15:48                 ` Minchan Kim
  2010-04-05 16:04                 ` Rik van Riel
@ 2010-04-05 16:13                 ` Rik van Riel
  2 siblings, 0 replies; 242+ messages in thread
From: Rik van Riel @ 2010-04-05 16:13 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Minchan Kim, Andrew Morton, Borislav Petkov,
	Linux Kernel Mailing List, KOSAKI Motohiro, Lee Schermerhorn,
	Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson

Fix a memory leak in anon_vma_fork(), where we fail to tear down the
anon_vmas attached to the new VMA in case setting up the new anon_vma
fails.

This bug also has the potential to leave behind anon_vma_chain structs
with pointers to invalid memory.

Reported-by: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Rik van Riel <riel@redhat.com>

diff --git a/mm/rmap.c b/mm/rmap.c
index fcd593c..eaa7a09 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -232,6 +232,7 @@ int anon_vma_fork(struct vm_area_struct *vma, struct vm_area_struct *pvma)
  out_error_free_anon_vma:
 	anon_vma_free(anon_vma);
  out_error:
+	unlink_anon_vmas(vma);
 	return -ENOMEM;
 }
 

^ permalink raw reply related	[flat|nested] 242+ messages in thread

* Re: Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3)
  2010-04-02 18:09   ` Linus Torvalds
  2010-04-02 15:24     ` Andrew Morton
@ 2010-04-06  8:53     ` KOSAKI Motohiro
  2010-04-06 10:09       ` KOSAKI Motohiro
  2010-04-06 14:38       ` Rik van Riel
  1 sibling, 2 replies; 242+ messages in thread
From: KOSAKI Motohiro @ 2010-04-06  8:53 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: kosaki.motohiro, Borislav Petkov, Rik van Riel, Andrew Morton,
	Linux Kernel Mailing List, Lee Schermerhorn, Minchan Kim,
	Nick Piggin, Andrea Arcangeli, Hugh Dickins

> 
> I think this is likely due to the new scalable anon_vma linking by Rik. 
> Nothing else I can imagine should have introduced anything like it.
> 
> Rik: the picures have the information, but you need to look at several to 
> see both the oops and the backtrace. Here's a condensed version:
> 
>  shrink_all_memory ->
>    do_try_to_free_pages ->
>      shrink_zone ->
>        shrink_inactive_list ->
>          shrink_page_list ->
>            page_referenced
> 
> where page_referenced() oopses due page_referenced_anon() as per 
> Borislav's description below.
> 
> Added all the usual suspects to the Cc list. Left the full report appended 
> so that the new people don't have to search for it on lkml.

Today, I've reviewed this patch carefully. but I haven't found any bug.

1) anon_vma->list is alwasys protected anon_vma->lock.
2) If anyone forget to take lock, list_add() and/or list_del() never
   assign to NULL.

then, NULL mean either three possibility.

 a) we see uninitialized data
 b) we see after freed data
 c) we see memory corruption by another bug

but (a) can't happen because 

	static inline void __list_add()
	{
	        next->prev = new;
	        new->next = next;
	        new->prev = prev;
	        prev->next = new;  (*)
	}

If uninitialized var is linked to avc list, new->next was already !NULL.

(b) is also impossible. SLAB_DESTROY_BY_RCU delay the page for anon_vma 
freeing until next rcu period. It mean rcu_read_lock()+page_mapped() 
can see kfree()ed page. but it is safe. noone corrupt it.

now I doubt (c) ;-)



Also, I've runned stress workload with shrink_all_memory() today. but
I couldn't reproduce the issue. hmm..  (perhaps I'm no lucky guy. 
I'm frequently fail to reproduce)

I'll continue to work.



^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3)
  2010-04-06  8:53     ` Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3) KOSAKI Motohiro
@ 2010-04-06 10:09       ` KOSAKI Motohiro
  2010-04-06 14:34         ` Rik van Riel
  2010-04-06 14:38       ` Rik van Riel
  1 sibling, 1 reply; 242+ messages in thread
From: KOSAKI Motohiro @ 2010-04-06 10:09 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: kosaki.motohiro, Linus Torvalds, Borislav Petkov, Rik van Riel,
	Andrew Morton, Linux Kernel Mailing List, Lee Schermerhorn,
	Minchan Kim, Nick Piggin, Andrea Arcangeli, Hugh Dickins

> (b) is also impossible. SLAB_DESTROY_BY_RCU delay the page for anon_vma 
> freeing until next rcu period. It mean rcu_read_lock()+page_mapped() 
> can see kfree()ed page. but it is safe. noone corrupt it.

by the way: I haven't understand why rik's per process anon_vma concept
works correctly with ksm. ksm increase anon_vma->ksm_refcount. but it seems
not guranteed vma->anon_vma and page->anon_vma are the same.

but I guess bug reporter doesn't use ksm, it's minor feature.




^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3)
  2010-04-06 10:09       ` KOSAKI Motohiro
@ 2010-04-06 14:34         ` Rik van Riel
  0 siblings, 0 replies; 242+ messages in thread
From: Rik van Riel @ 2010-04-06 14:34 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Linus Torvalds, Borislav Petkov, Andrew Morton,
	Linux Kernel Mailing List, Lee Schermerhorn, Minchan Kim,
	Nick Piggin, Andrea Arcangeli, Hugh Dickins

On 04/06/2010 06:09 AM, KOSAKI Motohiro wrote:
>> (b) is also impossible. SLAB_DESTROY_BY_RCU delay the page for anon_vma
>> freeing until next rcu period. It mean rcu_read_lock()+page_mapped()
>> can see kfree()ed page. but it is safe. noone corrupt it.
>
> by the way: I haven't understand why rik's per process anon_vma concept
> works correctly with ksm. ksm increase anon_vma->ksm_refcount. but it seems
> not guranteed vma->anon_vma and page->anon_vma are the same.

KSM removes the page from its original anon_vma.

If the page gets reinstantiated (copy on write), it will be
created in the vma->anon_vma.

Am I overlooking something?

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3)
  2010-04-06  8:53     ` Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3) KOSAKI Motohiro
  2010-04-06 10:09       ` KOSAKI Motohiro
@ 2010-04-06 14:38       ` Rik van Riel
  2010-04-06 15:34         ` Minchan Kim
  2010-04-06 17:05         ` Borislav Petkov
  1 sibling, 2 replies; 242+ messages in thread
From: Rik van Riel @ 2010-04-06 14:38 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Linus Torvalds, Borislav Petkov, Andrew Morton,
	Linux Kernel Mailing List, Lee Schermerhorn, Minchan Kim,
	Nick Piggin, Andrea Arcangeli, Hugh Dickins

On 04/06/2010 04:53 AM, KOSAKI Motohiro wrote:

> Today, I've reviewed this patch carefully. but I haven't found any bug.

> Also, I've runned stress workload with shrink_all_memory() today. but
> I couldn't reproduce the issue. hmm..  (perhaps I'm no lucky guy.
> I'm frequently fail to reproduce)
>
> I'll continue to work.

My status with this bug is the same - I have gone through
the code from all angles, but have not found any other bugs
yet (except for that leak - which could leave invalid
pointers behind).

This makes me wonder if perhaps the bug is a side effect
of something Borislav (and the other reproducers) have
in their kernel configuration, which we do not have.

Another (unlikely) thing is that the fix for the leak
makes the bug go away.  Yes, very unlikely.

Borislav, could you please send us your .config ?

Also, if you have the time, could you try out the
patch (-v2) I mailed in a little up this thread
that fixes the memory leak in anon_vma_fork?

I suspect it should not change anything, but it
could be useful to rule out anyway.

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3)
  2010-04-06 14:38       ` Rik van Riel
@ 2010-04-06 15:34         ` Minchan Kim
  2010-04-06 15:40           ` Rik van Riel
  2010-04-06 15:55           ` Linus Torvalds
  2010-04-06 17:05         ` Borislav Petkov
  1 sibling, 2 replies; 242+ messages in thread
From: Minchan Kim @ 2010-04-06 15:34 UTC (permalink / raw)
  To: Rik van Riel
  Cc: KOSAKI Motohiro, Linus Torvalds, Borislav Petkov, Andrew Morton,
	Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin,
	Andrea Arcangeli, Hugh Dickins

On Tue, 2010-04-06 at 10:38 -0400, Rik van Riel wrote:
> On 04/06/2010 04:53 AM, KOSAKI Motohiro wrote:
> 
> > Today, I've reviewed this patch carefully. but I haven't found any bug.
> 
> > Also, I've runned stress workload with shrink_all_memory() today. but
> > I couldn't reproduce the issue. hmm..  (perhaps I'm no lucky guy.
> > I'm frequently fail to reproduce)
> >
> > I'll continue to work.
> 
> My status with this bug is the same - I have gone through
> the code from all angles, but have not found any other bugs
> yet (except for that leak - which could leave invalid
> pointers behind).

Let's see the unlink_anon_vmas. 

1. list_for_each_entry_safe(avc,next, vma->anon_vma_chain, same_vma)
2. 	anon_vma_unlink
3. 		spin_lock(anon_vma->lock) <-- HERE LOCK.
4.		list_del(anon_vma_chain->same_anon_vma);

What if anon_vma is destroyed and reuse by SLAB_XXX_RCU for another
anon_vma object between 2 and 3?
I mean how to make sure 3) does lock valid anon_vma? 

I hope it is culprit.


-- 
Kind regards,
Minchan Kim



^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3)
  2010-04-06 15:34         ` Minchan Kim
@ 2010-04-06 15:40           ` Rik van Riel
  2010-04-06 15:58             ` Minchan Kim
  2010-04-06 15:55           ` Linus Torvalds
  1 sibling, 1 reply; 242+ messages in thread
From: Rik van Riel @ 2010-04-06 15:40 UTC (permalink / raw)
  To: Minchan Kim
  Cc: KOSAKI Motohiro, Linus Torvalds, Borislav Petkov, Andrew Morton,
	Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin,
	Andrea Arcangeli, Hugh Dickins

On 04/06/2010 11:34 AM, Minchan Kim wrote:

> Let's see the unlink_anon_vmas.
>
> 1. list_for_each_entry_safe(avc,next, vma->anon_vma_chain, same_vma)
> 2. 	anon_vma_unlink
> 3. 		spin_lock(anon_vma->lock)<-- HERE LOCK.
> 4.		list_del(anon_vma_chain->same_anon_vma);
>
> What if anon_vma is destroyed and reuse by SLAB_XXX_RCU for another
> anon_vma object between 2 and 3?
> I mean how to make sure 3) does lock valid anon_vma?
>
> I hope it is culprit.

How can the anon_vma get destroyed and reused, when this
anon_vma_chain still has a reference to it (and the
anon_vma has not been freed yet)?

What combination of circumstances is necessary for
your bug hypothetical to happen?

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3)
  2010-04-06 15:34         ` Minchan Kim
  2010-04-06 15:40           ` Rik van Riel
@ 2010-04-06 15:55           ` Linus Torvalds
  2010-04-06 16:23             ` Minchan Kim
  2010-04-07  8:37             ` Peter Zijlstra
  1 sibling, 2 replies; 242+ messages in thread
From: Linus Torvalds @ 2010-04-06 15:55 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Rik van Riel, KOSAKI Motohiro, Borislav Petkov, Andrew Morton,
	Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin,
	Andrea Arcangeli, Hugh Dickins



On Wed, 7 Apr 2010, Minchan Kim wrote:
> 
> Let's see the unlink_anon_vmas. 
> 
> 1. list_for_each_entry_safe(avc,next, vma->anon_vma_chain, same_vma)
> 2. 	anon_vma_unlink
> 3. 		spin_lock(anon_vma->lock) <-- HERE LOCK.
> 4.		list_del(anon_vma_chain->same_anon_vma);
> 
> What if anon_vma is destroyed and reuse by SLAB_XXX_RCU for another
> anon_vma object between 2 and 3?
> I mean how to make sure 3) does lock valid anon_vma? 
> 
> I hope it is culprit.

I don't think so. That isn't the racy case. We're working with a 
anon_vma_chain, so the anonvma is all there.

The racy case is when we look up an anonvma by the page, and the page gets 
unmapped at the same time because somebody else is travelling over the LRU 
list of the page itself, isn't it?

I do wonder if "page_lock_anon_vma()" should check the whole 
"page_mapped()" case _after_ taking the anon_vma lock. Because if the race 
happens, we're following a anon_vma list that has nothing to do with that 
page (it's stilla _valid_ list, since we locked the anon_vma, but will it 
be ok?)

IOW, what is it that really keeps the anon_vma list reliable _and_ 
relevant wrt the page? We know we may get a stale anon_vma, are we ok if 
that anon_vma list doesn't actually have anything to do with the page any 
more?

I think the first check in "page_address_in_vma()" protects us, but 
whatever.

However, that made me look at the PAGE_MIGRATION case. That seems to be 
just broken. It's doing that page_anon_vma() + spin_lock without holding 
any RCU locks, so there is no guarantee that anon_vma there is at all 
valid.

Is that function always called with rcu_read_lock()? 

			Linus

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3)
  2010-04-06 15:40           ` Rik van Riel
@ 2010-04-06 15:58             ` Minchan Kim
  0 siblings, 0 replies; 242+ messages in thread
From: Minchan Kim @ 2010-04-06 15:58 UTC (permalink / raw)
  To: Rik van Riel
  Cc: KOSAKI Motohiro, Linus Torvalds, Borislav Petkov, Andrew Morton,
	Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin,
	Andrea Arcangeli, Hugh Dickins

On Tue, 2010-04-06 at 11:40 -0400, Rik van Riel wrote:
> On 04/06/2010 11:34 AM, Minchan Kim wrote:
> 
> > Let's see the unlink_anon_vmas.
> >
> > 1. list_for_each_entry_safe(avc,next, vma->anon_vma_chain, same_vma)
> > 2. 	anon_vma_unlink
> > 3. 		spin_lock(anon_vma->lock)<-- HERE LOCK.
> > 4.		list_del(anon_vma_chain->same_anon_vma);
> >
> > What if anon_vma is destroyed and reuse by SLAB_XXX_RCU for another
> > anon_vma object between 2 and 3?
> > I mean how to make sure 3) does lock valid anon_vma?
> >
> > I hope it is culprit.
> 
> How can the anon_vma get destroyed and reused, when this
> anon_vma_chain still has a reference to it (and the

Doesn't anon_vma_chain have a ref counter on anon_vma?

> anon_vma has not been freed yet)?

AFAIK, anon_vma can be reused without free by SLAB_XXX_RCU.
So we always use it carefully by page_lock_anon_vma or manual check
with RCU and page_mapped. 

What am I missing?

> 
> What combination of circumstances is necessary for
> your bug hypothetical to happen?


	CPU A				CPU B

unlink_anon_vmas
list_for_each_entry 
				
					free_pgtable
					anon_vma_unlink
  <crazy stall>				spin_lock(anon_vma);
					list_del(same_anon_vma)
					spin_unlock(anon_vma)
anon_vma_unlink
					anon_vma_free
					reuse for another anon_vma
spin_lock(another anon_vma)
list_del(another anon_vma)

If my assumption is wrong, please correct me. 
Thanks, Rik. 

-- 
Kind regards,
Minchan Kim



^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3)
  2010-04-06 15:55           ` Linus Torvalds
@ 2010-04-06 16:23             ` Minchan Kim
  2010-04-06 16:28               ` Linus Torvalds
  2010-04-06 16:32               ` Linus Torvalds
  2010-04-07  8:37             ` Peter Zijlstra
  1 sibling, 2 replies; 242+ messages in thread
From: Minchan Kim @ 2010-04-06 16:23 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Rik van Riel, KOSAKI Motohiro, Borislav Petkov, Andrew Morton,
	Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin,
	Andrea Arcangeli, Hugh Dickins

Hi, Linus. 

On Tue, 2010-04-06 at 08:55 -0700, Linus Torvalds wrote:
> 
> On Wed, 7 Apr 2010, Minchan Kim wrote:
> > 
> > Let's see the unlink_anon_vmas. 
> > 
> > 1. list_for_each_entry_safe(avc,next, vma->anon_vma_chain, same_vma)
> > 2. 	anon_vma_unlink
> > 3. 		spin_lock(anon_vma->lock) <-- HERE LOCK.
> > 4.		list_del(anon_vma_chain->same_anon_vma);
> > 
> > What if anon_vma is destroyed and reuse by SLAB_XXX_RCU for another
> > anon_vma object between 2 and 3?
> > I mean how to make sure 3) does lock valid anon_vma? 
> > 
> > I hope it is culprit.
> 
> I don't think so. That isn't the racy case. We're working with a 
> anon_vma_chain, so the anonvma is all there.
> 

But the anon_vma is using for another anon_vma. 
Nonetheless, anon_vma_unlink does list_del(anon_vma's same_anon_vma).
I doubt it. 

> The racy case is when we look up an anonvma by the page, and the page gets 
> unmapped at the same time because somebody else is travelling over the LRU 
> list of the page itself, isn't it?

Yes. but I thought page might travel with anon_vmas which have
same_anon_vma deleted by race.

> 
> I do wonder if "page_lock_anon_vma()" should check the whole 
> "page_mapped()" case _after_ taking the anon_vma lock. Because if the race 
> happens, we're following a anon_vma list that has nothing to do with that 
> page (it's stilla _valid_ list, since we locked the anon_vma, but will it 
> be ok?)

So we always use it with (vma_address and page_check_address) to make
sure validation of anon_vma.
But I think it's not good design. I want to hold lock ahead checking of
page_mapped but maybe performance issue? I am not sure. 

> 
> IOW, what is it that really keeps the anon_vma list reliable _and_ 
> relevant wrt the page? We know we may get a stale anon_vma, are we ok if 
> that anon_vma list doesn't actually have anything to do with the page any 
> more?
> I think the first check in "page_address_in_vma()" protects us, but 
> whatever.
> 
> However, that made me look at the PAGE_MIGRATION case. That seems to be 
> just broken. It's doing that page_anon_vma() + spin_lock without holding 
> any RCU locks, so there is no guarantee that anon_vma there is at all 
> valid.

FYI, recently there is a patch about migration case. 
http://lkml.org/lkml/2010/4/2/145


> 
> Is that function always called with rcu_read_lock()? 
> 
> 			Linus


-- 
Kind regards,
Minchan Kim



^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3)
  2010-04-06 16:23             ` Minchan Kim
@ 2010-04-06 16:28               ` Linus Torvalds
  2010-04-06 16:45                 ` Minchan Kim
  2010-04-06 16:32               ` Linus Torvalds
  1 sibling, 1 reply; 242+ messages in thread
From: Linus Torvalds @ 2010-04-06 16:28 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Rik van Riel, KOSAKI Motohiro, Borislav Petkov, Andrew Morton,
	Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin,
	Andrea Arcangeli, Hugh Dickins



On Wed, 7 Apr 2010, Minchan Kim wrote:
> > 
> > However, that made me look at the PAGE_MIGRATION case. That seems to be 
> > just broken. It's doing that page_anon_vma() + spin_lock without holding 
> > any RCU locks, so there is no guarantee that anon_vma there is at all 
> > valid.
> 
> FYI, recently there is a patch about migration case. 
> http://lkml.org/lkml/2010/4/2/145

No, I'm talking about rmap_walk_anon():

        anon_vma = page_anon_vma(page);
        if (!anon_vma)
                return ret;
        spin_lock(&anon_vma->lock);

which seems to be simply buggy. The anon_vma may not exist any more, 
because an RCU event might have really freed the page between looking it 
up and locking it.

			Linus

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3)
  2010-04-06 16:23             ` Minchan Kim
  2010-04-06 16:28               ` Linus Torvalds
@ 2010-04-06 16:32               ` Linus Torvalds
  2010-04-06 16:54                 ` Minchan Kim
  1 sibling, 1 reply; 242+ messages in thread
From: Linus Torvalds @ 2010-04-06 16:32 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Rik van Riel, KOSAKI Motohiro, Borislav Petkov, Andrew Morton,
	Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin,
	Andrea Arcangeli, Hugh Dickins



On Wed, 7 Apr 2010, Minchan Kim wrote:
> > 
> > I don't think so. That isn't the racy case. We're working with a 
> > anon_vma_chain, so the anonvma is all there.
> 
> But the anon_vma is using for another anon_vma. 

No, that can only happen if somebody has done "anon_vma_free()" on it. And 
nobody does that if the anonvma still has a non-empty'&anon_vma->head'.

So as long as the anon_vma has a anon_vma_chain entry associated with it 
(or a ksm refcount, but that's a separate issue), it's not going to be 
re-allocated for any other use, because it's not going to be free'd.
	
		Linus

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3)
  2010-04-06 16:28               ` Linus Torvalds
@ 2010-04-06 16:45                 ` Minchan Kim
  2010-04-06 16:53                   ` Linus Torvalds
  0 siblings, 1 reply; 242+ messages in thread
From: Minchan Kim @ 2010-04-06 16:45 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Rik van Riel, KOSAKI Motohiro, Borislav Petkov, Andrew Morton,
	Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin,
	Andrea Arcangeli, Hugh Dickins

On Tue, 2010-04-06 at 09:28 -0700, Linus Torvalds wrote:
> 
> On Wed, 7 Apr 2010, Minchan Kim wrote:
> > > 
> > > However, that made me look at the PAGE_MIGRATION case. That seems to be 
> > > just broken. It's doing that page_anon_vma() + spin_lock without holding 
> > > any RCU locks, so there is no guarantee that anon_vma there is at all 
> > > valid.
> > 
> > FYI, recently there is a patch about migration case. 
> > http://lkml.org/lkml/2010/4/2/145
> 
> No, I'm talking about rmap_walk_anon():
> 
>         anon_vma = page_anon_vma(page);
>         if (!anon_vma)
>                 return ret;
>         spin_lock(&anon_vma->lock);
> 
> which seems to be simply buggy. The anon_vma may not exist any more, 
> because an RCU event might have really freed the page between looking it 
> up and locking it.
> 
> 			Linus

unmap_and_move
	remove_migration_ptes
		rmap_walk
			rmap_walk_anon

We always has rcu_read_lock about anon page in unmap_and_move.
So I think it's not buggy. What am I missing?


-- 
Kind regards,
Minchan Kim



^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3)
  2010-04-06 16:45                 ` Minchan Kim
@ 2010-04-06 16:53                   ` Linus Torvalds
  2010-04-06 17:04                     ` Rik van Riel
  0 siblings, 1 reply; 242+ messages in thread
From: Linus Torvalds @ 2010-04-06 16:53 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Rik van Riel, KOSAKI Motohiro, Borislav Petkov, Andrew Morton,
	Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin,
	Andrea Arcangeli, Hugh Dickins



On Wed, 7 Apr 2010, Minchan Kim wrote:
> 
> unmap_and_move
> 	remove_migration_ptes
> 		rmap_walk
> 			rmap_walk_anon
> 
> We always has rcu_read_lock about anon page in unmap_and_move.
> So I think it's not buggy. What am I missing?

Ok, in that case it's fine.

However, it does bring back my comment about all those anonvma changes: 
the locking is totally undocumented. 

Why isn't there a thing _saying_ that it's ok because of this?

Why is there no comment about the locking of that 'same_vma' / 
'vma->anon_vma_chain' except for the totally nonsensical one about 
page_table_lock (which doesn't protect _any_ of the other cases)?

		Linus

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3)
  2010-04-06 16:32               ` Linus Torvalds
@ 2010-04-06 16:54                 ` Minchan Kim
  0 siblings, 0 replies; 242+ messages in thread
From: Minchan Kim @ 2010-04-06 16:54 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Rik van Riel, KOSAKI Motohiro, Borislav Petkov, Andrew Morton,
	Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin,
	Andrea Arcangeli, Hugh Dickins

On Tue, 2010-04-06 at 09:32 -0700, Linus Torvalds wrote:
> 
> On Wed, 7 Apr 2010, Minchan Kim wrote:
> > > 
> > > I don't think so. That isn't the racy case. We're working with a 
> > > anon_vma_chain, so the anonvma is all there.
> > 
> > But the anon_vma is using for another anon_vma. 
> 
> No, that can only happen if somebody has done "anon_vma_free()" on it. And 
> nobody does that if the anonvma still has a non-empty'&anon_vma->head'.
> 
> So as long as the anon_vma has a anon_vma_chain entry associated with it 
> (or a ksm refcount, but that's a separate issue), it's not going to be 
> re-allocated for any other use, because it's not going to be free'd.
> 	

> 		Linus

That's what I am missing.
Thanks, Linus. 

I will think over the problem. :)

-- 
Kind regards,
Minchan Kim



^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3)
  2010-04-06 16:53                   ` Linus Torvalds
@ 2010-04-06 17:04                     ` Rik van Riel
  2010-04-06 18:28                       ` Linus Torvalds
  0 siblings, 1 reply; 242+ messages in thread
From: Rik van Riel @ 2010-04-06 17:04 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Minchan Kim, KOSAKI Motohiro, Borislav Petkov, Andrew Morton,
	Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin,
	Andrea Arcangeli, Hugh Dickins

On 04/06/2010 12:53 PM, Linus Torvalds wrote:
> On Wed, 7 Apr 2010, Minchan Kim wrote:
>>
>> unmap_and_move
>> 	remove_migration_ptes
>> 		rmap_walk
>> 			rmap_walk_anon
>>
>> We always has rcu_read_lock about anon page in unmap_and_move.
>> So I think it's not buggy. What am I missing?
>
> Ok, in that case it's fine.
>
> However, it does bring back my comment about all those anonvma changes:
> the locking is totally undocumented.
>
> Why isn't there a thing _saying_ that it's ok because of this?
>
> Why is there no comment about the locking of that 'same_vma' /
> 'vma->anon_vma_chain' except for the totally nonsensical one about
> page_table_lock (which doesn't protect _any_ of the other cases)?

Which other cases? When do we ever walk the "same_vma" list
not from the context of the process owning the vma?

This bug in page_referenced is walking the "same_anon_vma" list,
which is locked with the anon_vma->lock.

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3)
  2010-04-06 14:38       ` Rik van Riel
  2010-04-06 15:34         ` Minchan Kim
@ 2010-04-06 17:05         ` Borislav Petkov
  1 sibling, 0 replies; 242+ messages in thread
From: Borislav Petkov @ 2010-04-06 17:05 UTC (permalink / raw)
  To: Rik van Riel
  Cc: KOSAKI Motohiro, Linus Torvalds, Andrew Morton,
	Linux Kernel Mailing List, Lee Schermerhorn, Minchan Kim,
	Nick Piggin, Andrea Arcangeli, Hugh Dickins

[-- Attachment #1: Type: text/plain, Size: 975 bytes --]

From: Rik van Riel <riel@redhat.com>
Date: Tue, Apr 06, 2010 at 10:38:18AM -0400

> This makes me wonder if perhaps the bug is a side effect
> of something Borislav (and the other reproducers) have
> in their kernel configuration, which we do not have.
> 
> Another (unlikely) thing is that the fix for the leak
> makes the bug go away.  Yes, very unlikely.
> 
> Borislav, could you please send us your .config ?

attached.

> Also, if you have the time, could you try out the
> patch (-v2) I mailed in a little up this thread
> that fixes the memory leak in anon_vma_fork?

Sure, building ontop of v2.6.34-rc3-288-gab195c5.

Will try to trigger it but let me remind you that it will take a while
since it doesn't happen everytime I suspend.

Any other printks or debug output which might be helpful to slap at the
site, page_referenced_anon() I mean?

> I suspect it should not change anything, but it
> could be useful to rule out anyway.
> 

-- 
Regards/Gruss,
    Boris.

[-- Attachment #2: config-2.6.34-rc3 --]
[-- Type: text/plain, Size: 62617 bytes --]

#
# Automatically generated make config: don't edit
# Linux kernel version: 2.6.34-rc3
# Tue Mar 30 22:52:17 2010
#
CONFIG_64BIT=y
# CONFIG_X86_32 is not set
CONFIG_X86_64=y
CONFIG_X86=y
CONFIG_OUTPUT_FORMAT="elf64-x86-64"
CONFIG_ARCH_DEFCONFIG="arch/x86/configs/x86_64_defconfig"
CONFIG_GENERIC_TIME=y
CONFIG_GENERIC_CMOS_UPDATE=y
CONFIG_CLOCKSOURCE_WATCHDOG=y
CONFIG_GENERIC_CLOCKEVENTS=y
CONFIG_GENERIC_CLOCKEVENTS_BROADCAST=y
CONFIG_LOCKDEP_SUPPORT=y
CONFIG_STACKTRACE_SUPPORT=y
CONFIG_HAVE_LATENCYTOP_SUPPORT=y
CONFIG_MMU=y
CONFIG_ZONE_DMA=y
CONFIG_NEED_DMA_MAP_STATE=y
CONFIG_GENERIC_ISA_DMA=y
CONFIG_GENERIC_IOMAP=y
CONFIG_GENERIC_BUG=y
CONFIG_GENERIC_BUG_RELATIVE_POINTERS=y
CONFIG_GENERIC_HWEIGHT=y
CONFIG_ARCH_MAY_HAVE_PC_FDC=y
# CONFIG_RWSEM_GENERIC_SPINLOCK is not set
CONFIG_RWSEM_XCHGADD_ALGORITHM=y
CONFIG_ARCH_HAS_CPU_IDLE_WAIT=y
CONFIG_GENERIC_CALIBRATE_DELAY=y
CONFIG_GENERIC_TIME_VSYSCALL=y
CONFIG_ARCH_HAS_CPU_RELAX=y
CONFIG_ARCH_HAS_DEFAULT_IDLE=y
CONFIG_ARCH_HAS_CACHE_LINE_SIZE=y
CONFIG_HAVE_SETUP_PER_CPU_AREA=y
CONFIG_NEED_PER_CPU_EMBED_FIRST_CHUNK=y
CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK=y
CONFIG_HAVE_CPUMASK_OF_CPU_MAP=y
CONFIG_ARCH_HIBERNATION_POSSIBLE=y
CONFIG_ARCH_SUSPEND_POSSIBLE=y
CONFIG_ZONE_DMA32=y
CONFIG_ARCH_POPULATES_NODE_MAP=y
CONFIG_AUDIT_ARCH=y
CONFIG_ARCH_SUPPORTS_OPTIMIZED_INLINING=y
CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC=y
CONFIG_HAVE_EARLY_RES=y
CONFIG_GENERIC_HARDIRQS=y
CONFIG_GENERIC_HARDIRQS_NO__DO_IRQ=y
CONFIG_GENERIC_IRQ_PROBE=y
CONFIG_GENERIC_PENDING_IRQ=y
CONFIG_USE_GENERIC_SMP_HELPERS=y
CONFIG_X86_64_SMP=y
CONFIG_X86_HT=y
CONFIG_X86_TRAMPOLINE=y
# CONFIG_KTIME_SCALAR is not set
CONFIG_DEFCONFIG_LIST="/lib/modules/$UNAME_RELEASE/.config"
CONFIG_CONSTRUCTORS=y

#
# General setup
#
CONFIG_EXPERIMENTAL=y
CONFIG_LOCK_KERNEL=y
CONFIG_INIT_ENV_ARG_LIMIT=32
CONFIG_LOCALVERSION=""
CONFIG_LOCALVERSION_AUTO=y
CONFIG_HAVE_KERNEL_GZIP=y
CONFIG_HAVE_KERNEL_BZIP2=y
CONFIG_HAVE_KERNEL_LZMA=y
CONFIG_HAVE_KERNEL_LZO=y
CONFIG_KERNEL_GZIP=y
# CONFIG_KERNEL_BZIP2 is not set
# CONFIG_KERNEL_LZMA is not set
# CONFIG_KERNEL_LZO is not set
CONFIG_SWAP=y
CONFIG_SYSVIPC=y
CONFIG_SYSVIPC_SYSCTL=y
CONFIG_POSIX_MQUEUE=y
CONFIG_POSIX_MQUEUE_SYSCTL=y
# CONFIG_BSD_PROCESS_ACCT is not set
# CONFIG_TASKSTATS is not set
# CONFIG_AUDIT is not set

#
# RCU Subsystem
#
CONFIG_TREE_RCU=y
# CONFIG_TREE_PREEMPT_RCU is not set
# CONFIG_TINY_RCU is not set
# CONFIG_RCU_TRACE is not set
CONFIG_RCU_FANOUT=64
# CONFIG_RCU_FANOUT_EXACT is not set
# CONFIG_RCU_FAST_NO_HZ is not set
# CONFIG_TREE_RCU_TRACE is not set
CONFIG_IKCONFIG=y
CONFIG_IKCONFIG_PROC=y
CONFIG_LOG_BUF_SHIFT=21
CONFIG_HAVE_UNSTABLE_SCHED_CLOCK=y
# CONFIG_CGROUPS is not set
# CONFIG_SYSFS_DEPRECATED_V2 is not set
# CONFIG_RELAY is not set
CONFIG_NAMESPACES=y
# CONFIG_UTS_NS is not set
# CONFIG_IPC_NS is not set
# CONFIG_USER_NS is not set
# CONFIG_PID_NS is not set
# CONFIG_NET_NS is not set
# CONFIG_BLK_DEV_INITRD is not set
CONFIG_CC_OPTIMIZE_FOR_SIZE=y
CONFIG_SYSCTL=y
CONFIG_ANON_INODES=y
# CONFIG_EMBEDDED is not set
CONFIG_UID16=y
CONFIG_SYSCTL_SYSCALL=y
CONFIG_KALLSYMS=y
CONFIG_KALLSYMS_ALL=y
# CONFIG_KALLSYMS_EXTRA_PASS is not set
CONFIG_HOTPLUG=y
CONFIG_PRINTK=y
CONFIG_BUG=y
CONFIG_ELF_CORE=y
CONFIG_PCSPKR_PLATFORM=y
CONFIG_BASE_FULL=y
CONFIG_FUTEX=y
CONFIG_EPOLL=y
CONFIG_SIGNALFD=y
CONFIG_TIMERFD=y
CONFIG_EVENTFD=y
CONFIG_SHMEM=y
CONFIG_AIO=y
CONFIG_HAVE_PERF_EVENTS=y

#
# Kernel Performance Events And Counters
#
CONFIG_PERF_EVENTS=y
# CONFIG_PERF_COUNTERS is not set
# CONFIG_DEBUG_PERF_USE_VMALLOC is not set
CONFIG_VM_EVENT_COUNTERS=y
CONFIG_PCI_QUIRKS=y
CONFIG_SLUB_DEBUG=y
# CONFIG_COMPAT_BRK is not set
# CONFIG_SLAB is not set
CONFIG_SLUB=y
# CONFIG_SLOB is not set
# CONFIG_PROFILING is not set
CONFIG_TRACEPOINTS=y
CONFIG_HAVE_OPROFILE=y
# CONFIG_KPROBES is not set
CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS=y
CONFIG_USER_RETURN_NOTIFIER=y
CONFIG_HAVE_IOREMAP_PROT=y
CONFIG_HAVE_KPROBES=y
CONFIG_HAVE_KRETPROBES=y
CONFIG_HAVE_OPTPROBES=y
CONFIG_HAVE_ARCH_TRACEHOOK=y
CONFIG_HAVE_DMA_ATTRS=y
CONFIG_HAVE_REGS_AND_STACK_ACCESS_API=y
CONFIG_HAVE_DMA_API_DEBUG=y
CONFIG_HAVE_HW_BREAKPOINT=y
CONFIG_HAVE_USER_RETURN_NOTIFIER=y

#
# GCOV-based kernel profiling
#
# CONFIG_GCOV_KERNEL is not set
CONFIG_SLOW_WORK=y
# CONFIG_SLOW_WORK_DEBUG is not set
# CONFIG_HAVE_GENERIC_DMA_COHERENT is not set
CONFIG_SLABINFO=y
CONFIG_RT_MUTEXES=y
CONFIG_BASE_SMALL=0
CONFIG_MODULES=y
CONFIG_MODULE_FORCE_LOAD=y
CONFIG_MODULE_UNLOAD=y
CONFIG_MODULE_FORCE_UNLOAD=y
CONFIG_MODVERSIONS=y
CONFIG_MODULE_SRCVERSION_ALL=y
CONFIG_STOP_MACHINE=y
CONFIG_BLOCK=y
CONFIG_BLK_DEV_BSG=y
# CONFIG_BLK_DEV_INTEGRITY is not set
CONFIG_BLOCK_COMPAT=y

#
# IO Schedulers
#
CONFIG_IOSCHED_NOOP=y
CONFIG_IOSCHED_DEADLINE=y
CONFIG_IOSCHED_CFQ=y
# CONFIG_DEFAULT_DEADLINE is not set
CONFIG_DEFAULT_CFQ=y
# CONFIG_DEFAULT_NOOP is not set
CONFIG_DEFAULT_IOSCHED="cfq"
CONFIG_PREEMPT_NOTIFIERS=y
# CONFIG_INLINE_SPIN_TRYLOCK is not set
# CONFIG_INLINE_SPIN_TRYLOCK_BH is not set
# CONFIG_INLINE_SPIN_LOCK is not set
# CONFIG_INLINE_SPIN_LOCK_BH is not set
# CONFIG_INLINE_SPIN_LOCK_IRQ is not set
# CONFIG_INLINE_SPIN_LOCK_IRQSAVE is not set
# CONFIG_INLINE_SPIN_UNLOCK is not set
# CONFIG_INLINE_SPIN_UNLOCK_BH is not set
# CONFIG_INLINE_SPIN_UNLOCK_IRQ is not set
# CONFIG_INLINE_SPIN_UNLOCK_IRQRESTORE is not set
# CONFIG_INLINE_READ_TRYLOCK is not set
# CONFIG_INLINE_READ_LOCK is not set
# CONFIG_INLINE_READ_LOCK_BH is not set
# CONFIG_INLINE_READ_LOCK_IRQ is not set
# CONFIG_INLINE_READ_LOCK_IRQSAVE is not set
# CONFIG_INLINE_READ_UNLOCK is not set
# CONFIG_INLINE_READ_UNLOCK_BH is not set
# CONFIG_INLINE_READ_UNLOCK_IRQ is not set
# CONFIG_INLINE_READ_UNLOCK_IRQRESTORE is not set
# CONFIG_INLINE_WRITE_TRYLOCK is not set
# CONFIG_INLINE_WRITE_LOCK is not set
# CONFIG_INLINE_WRITE_LOCK_BH is not set
# CONFIG_INLINE_WRITE_LOCK_IRQ is not set
# CONFIG_INLINE_WRITE_LOCK_IRQSAVE is not set
# CONFIG_INLINE_WRITE_UNLOCK is not set
# CONFIG_INLINE_WRITE_UNLOCK_BH is not set
# CONFIG_INLINE_WRITE_UNLOCK_IRQ is not set
# CONFIG_INLINE_WRITE_UNLOCK_IRQRESTORE is not set
# CONFIG_MUTEX_SPIN_ON_OWNER is not set
CONFIG_FREEZER=y

#
# Processor type and features
#
CONFIG_TICK_ONESHOT=y
CONFIG_NO_HZ=y
CONFIG_HIGH_RES_TIMERS=y
CONFIG_GENERIC_CLOCKEVENTS_BUILD=y
CONFIG_SMP=y
# CONFIG_SPARSE_IRQ is not set
# CONFIG_X86_MPPARSE is not set
# CONFIG_X86_EXTENDED_PLATFORM is not set
CONFIG_X86_SUPPORTS_MEMORY_FAILURE=y
CONFIG_SCHED_OMIT_FRAME_POINTER=y
# CONFIG_PARAVIRT_GUEST is not set
CONFIG_NO_BOOTMEM=y
CONFIG_MEMTEST=y
# CONFIG_M386 is not set
# CONFIG_M486 is not set
# CONFIG_M586 is not set
# CONFIG_M586TSC is not set
# CONFIG_M586MMX is not set
# CONFIG_M686 is not set
# CONFIG_MPENTIUMII is not set
# CONFIG_MPENTIUMIII is not set
# CONFIG_MPENTIUMM is not set
# CONFIG_MPENTIUM4 is not set
# CONFIG_MK6 is not set
# CONFIG_MK7 is not set
CONFIG_MK8=y
# CONFIG_MCRUSOE is not set
# CONFIG_MEFFICEON is not set
# CONFIG_MWINCHIPC6 is not set
# CONFIG_MWINCHIP3D is not set
# CONFIG_MGEODEGX1 is not set
# CONFIG_MGEODE_LX is not set
# CONFIG_MCYRIXIII is not set
# CONFIG_MVIAC3_2 is not set
# CONFIG_MVIAC7 is not set
# CONFIG_MPSC is not set
# CONFIG_MCORE2 is not set
# CONFIG_MATOM is not set
# CONFIG_GENERIC_CPU is not set
CONFIG_X86_CPU=y
CONFIG_X86_INTERNODE_CACHE_SHIFT=6
CONFIG_X86_CMPXCHG=y
CONFIG_X86_L1_CACHE_SHIFT=6
CONFIG_X86_XADD=y
CONFIG_X86_WP_WORKS_OK=y
CONFIG_X86_INTEL_USERCOPY=y
CONFIG_X86_USE_PPRO_CHECKSUM=y
CONFIG_X86_TSC=y
CONFIG_X86_CMPXCHG64=y
CONFIG_X86_CMOV=y
CONFIG_X86_MINIMUM_CPU_FAMILY=64
CONFIG_X86_DEBUGCTLMSR=y
CONFIG_CPU_SUP_INTEL=y
CONFIG_CPU_SUP_AMD=y
CONFIG_CPU_SUP_CENTAUR=y
# CONFIG_X86_DS is not set
CONFIG_HPET_TIMER=y
CONFIG_HPET_EMULATE_RTC=y
CONFIG_DMI=y
CONFIG_GART_IOMMU=y
# CONFIG_CALGARY_IOMMU is not set
# CONFIG_AMD_IOMMU is not set
CONFIG_SWIOTLB=y
CONFIG_IOMMU_HELPER=y
# CONFIG_IOMMU_API is not set
# CONFIG_MAXSMP is not set
CONFIG_NR_CPUS=8
# CONFIG_SCHED_SMT is not set
CONFIG_SCHED_MC=y
# CONFIG_PREEMPT_NONE is not set
# CONFIG_PREEMPT_VOLUNTARY is not set
CONFIG_PREEMPT=y
CONFIG_X86_LOCAL_APIC=y
CONFIG_X86_IO_APIC=y
# CONFIG_X86_REROUTE_FOR_BROKEN_BOOT_IRQS is not set
CONFIG_X86_MCE=y
# CONFIG_X86_MCE_INTEL is not set
CONFIG_X86_MCE_AMD=y
CONFIG_X86_MCE_THRESHOLD=y
# CONFIG_X86_MCE_INJECT is not set
# CONFIG_I8K is not set
CONFIG_MICROCODE=m
# CONFIG_MICROCODE_INTEL is not set
CONFIG_MICROCODE_AMD=y
CONFIG_MICROCODE_OLD_INTERFACE=y
CONFIG_X86_MSR=m
CONFIG_X86_CPUID=m
CONFIG_ARCH_PHYS_ADDR_T_64BIT=y
CONFIG_DIRECT_GBPAGES=y
# CONFIG_NUMA is not set
CONFIG_ARCH_PROC_KCORE_TEXT=y
CONFIG_ARCH_SPARSEMEM_DEFAULT=y
CONFIG_ARCH_SPARSEMEM_ENABLE=y
CONFIG_ARCH_SELECT_MEMORY_MODEL=y
CONFIG_ILLEGAL_POINTER_VALUE=0xdead000000000000
CONFIG_SELECT_MEMORY_MODEL=y
# CONFIG_FLATMEM_MANUAL is not set
# CONFIG_DISCONTIGMEM_MANUAL is not set
CONFIG_SPARSEMEM_MANUAL=y
CONFIG_SPARSEMEM=y
CONFIG_HAVE_MEMORY_PRESENT=y
CONFIG_SPARSEMEM_EXTREME=y
CONFIG_SPARSEMEM_VMEMMAP_ENABLE=y
CONFIG_SPARSEMEM_ALLOC_MEM_MAP_TOGETHER=y
CONFIG_SPARSEMEM_VMEMMAP=y
# CONFIG_MEMORY_HOTPLUG is not set
CONFIG_PAGEFLAGS_EXTENDED=y
CONFIG_SPLIT_PTLOCK_CPUS=999999
CONFIG_PHYS_ADDR_T_64BIT=y
CONFIG_ZONE_DMA_FLAG=1
CONFIG_BOUNCE=y
CONFIG_VIRT_TO_BUS=y
CONFIG_MMU_NOTIFIER=y
# CONFIG_KSM is not set
CONFIG_DEFAULT_MMAP_MIN_ADDR=4096
CONFIG_ARCH_SUPPORTS_MEMORY_FAILURE=y
# CONFIG_MEMORY_FAILURE is not set
# CONFIG_X86_CHECK_BIOS_CORRUPTION is not set
CONFIG_X86_RESERVE_LOW_64K=y
CONFIG_MTRR=y
CONFIG_MTRR_SANITIZER=y
CONFIG_MTRR_SANITIZER_ENABLE_DEFAULT=0
CONFIG_MTRR_SANITIZER_SPARE_REG_NR_DEFAULT=1
CONFIG_X86_PAT=y
CONFIG_ARCH_USES_PG_UNCACHED=y
# CONFIG_EFI is not set
# CONFIG_SECCOMP is not set
# CONFIG_CC_STACKPROTECTOR is not set
# CONFIG_HZ_100 is not set
# CONFIG_HZ_250 is not set
# CONFIG_HZ_300 is not set
CONFIG_HZ_1000=y
CONFIG_HZ=1000
CONFIG_SCHED_HRTICK=y
# CONFIG_KEXEC is not set
# CONFIG_CRASH_DUMP is not set
CONFIG_PHYSICAL_START=0x1000000
# CONFIG_RELOCATABLE is not set
CONFIG_PHYSICAL_ALIGN=0x1000000
CONFIG_HOTPLUG_CPU=y
# CONFIG_COMPAT_VDSO is not set
# CONFIG_CMDLINE_BOOL is not set
CONFIG_ARCH_ENABLE_MEMORY_HOTPLUG=y

#
# Power management and ACPI options
#
CONFIG_ARCH_HIBERNATION_HEADER=y
CONFIG_PM=y
# CONFIG_PM_DEBUG is not set
CONFIG_PM_SLEEP_SMP=y
CONFIG_PM_SLEEP=y
CONFIG_SUSPEND=y
CONFIG_SUSPEND_FREEZER=y
CONFIG_HIBERNATION_NVS=y
CONFIG_HIBERNATION=y
CONFIG_PM_STD_PARTITION="/dev/sda2"
# CONFIG_PM_RUNTIME is not set
CONFIG_PM_OPS=y
CONFIG_ACPI=y
CONFIG_ACPI_SLEEP=y
# CONFIG_ACPI_PROCFS is not set
# CONFIG_ACPI_PROCFS_POWER is not set
# CONFIG_ACPI_POWER_METER is not set
CONFIG_ACPI_SYSFS_POWER=y
# CONFIG_ACPI_PROC_EVENT is not set
# CONFIG_ACPI_AC is not set
# CONFIG_ACPI_BATTERY is not set
# CONFIG_ACPI_BUTTON is not set
# CONFIG_ACPI_FAN is not set
CONFIG_ACPI_DOCK=y
CONFIG_ACPI_PROCESSOR=y
CONFIG_ACPI_HOTPLUG_CPU=y
# CONFIG_ACPI_PROCESSOR_AGGREGATOR is not set
CONFIG_ACPI_THERMAL=y
# CONFIG_ACPI_CUSTOM_DSDT is not set
CONFIG_ACPI_BLACKLIST_YEAR=0
# CONFIG_ACPI_DEBUG is not set
# CONFIG_ACPI_PCI_SLOT is not set
CONFIG_X86_PM_TIMER=y
CONFIG_ACPI_CONTAINER=y
# CONFIG_ACPI_SBS is not set
# CONFIG_SFI is not set

#
# CPU Frequency scaling
#
CONFIG_CPU_FREQ=y
CONFIG_CPU_FREQ_TABLE=m
# CONFIG_CPU_FREQ_DEBUG is not set
# CONFIG_CPU_FREQ_STAT is not set
CONFIG_CPU_FREQ_DEFAULT_GOV_PERFORMANCE=y
# CONFIG_CPU_FREQ_DEFAULT_GOV_POWERSAVE is not set
# CONFIG_CPU_FREQ_DEFAULT_GOV_USERSPACE is not set
# CONFIG_CPU_FREQ_DEFAULT_GOV_ONDEMAND is not set
# CONFIG_CPU_FREQ_DEFAULT_GOV_CONSERVATIVE is not set
CONFIG_CPU_FREQ_GOV_PERFORMANCE=y
CONFIG_CPU_FREQ_GOV_POWERSAVE=m
CONFIG_CPU_FREQ_GOV_USERSPACE=m
CONFIG_CPU_FREQ_GOV_ONDEMAND=m
CONFIG_CPU_FREQ_GOV_CONSERVATIVE=m

#
# CPUFreq processor drivers
#
# CONFIG_X86_PCC_CPUFREQ is not set
CONFIG_X86_ACPI_CPUFREQ=m
CONFIG_X86_POWERNOW_K8=m
# CONFIG_X86_SPEEDSTEP_CENTRINO is not set
# CONFIG_X86_P4_CLOCKMOD is not set

#
# shared options
#
# CONFIG_X86_SPEEDSTEP_LIB is not set
CONFIG_CPU_IDLE=y
CONFIG_CPU_IDLE_GOV_LADDER=y
CONFIG_CPU_IDLE_GOV_MENU=y

#
# Memory power savings
#
# CONFIG_I7300_IDLE is not set

#
# Bus options (PCI etc.)
#
CONFIG_PCI=y
CONFIG_PCI_DIRECT=y
CONFIG_PCI_MMCONFIG=y
CONFIG_PCI_DOMAINS=y
CONFIG_PCIEPORTBUS=y
CONFIG_PCIEAER=y
# CONFIG_PCIE_ECRC is not set
# CONFIG_PCIEAER_INJECT is not set
# CONFIG_PCIEASPM is not set
CONFIG_ARCH_SUPPORTS_MSI=y
# CONFIG_PCI_MSI is not set
# CONFIG_PCI_DEBUG is not set
# CONFIG_PCI_STUB is not set
CONFIG_HT_IRQ=y
# CONFIG_PCI_IOV is not set
CONFIG_PCI_IOAPIC=y
CONFIG_ISA_DMA_API=y
CONFIG_K8_NB=y
# CONFIG_PCCARD is not set
# CONFIG_HOTPLUG_PCI is not set

#
# Executable file formats / Emulations
#
CONFIG_BINFMT_ELF=y
CONFIG_COMPAT_BINFMT_ELF=y
# CONFIG_CORE_DUMP_DEFAULT_ELF_HEADERS is not set
# CONFIG_HAVE_AOUT is not set
CONFIG_BINFMT_MISC=m
CONFIG_IA32_EMULATION=y
CONFIG_IA32_AOUT=y
CONFIG_COMPAT=y
CONFIG_COMPAT_FOR_U64_ALIGNMENT=y
CONFIG_SYSVIPC_COMPAT=y
CONFIG_NET=y

#
# Networking options
#
CONFIG_PACKET=m
CONFIG_UNIX=y
CONFIG_XFRM=y
CONFIG_XFRM_USER=m
# CONFIG_XFRM_SUB_POLICY is not set
# CONFIG_XFRM_MIGRATE is not set
# CONFIG_XFRM_STATISTICS is not set
CONFIG_XFRM_IPCOMP=m
# CONFIG_NET_KEY is not set
CONFIG_INET=y
CONFIG_IP_MULTICAST=y
# CONFIG_IP_ADVANCED_ROUTER is not set
CONFIG_IP_FIB_HASH=y
# CONFIG_IP_PNP is not set
CONFIG_NET_IPIP=m
CONFIG_NET_IPGRE=m
# CONFIG_NET_IPGRE_BROADCAST is not set
# CONFIG_IP_MROUTE is not set
# CONFIG_ARPD is not set
CONFIG_SYN_COOKIES=y
CONFIG_INET_AH=m
CONFIG_INET_ESP=m
CONFIG_INET_IPCOMP=m
CONFIG_INET_XFRM_TUNNEL=m
CONFIG_INET_TUNNEL=m
CONFIG_INET_XFRM_MODE_TRANSPORT=m
CONFIG_INET_XFRM_MODE_TUNNEL=m
CONFIG_INET_XFRM_MODE_BEET=y
# CONFIG_INET_LRO is not set
CONFIG_INET_DIAG=m
CONFIG_INET_TCP_DIAG=m
# CONFIG_TCP_CONG_ADVANCED is not set
CONFIG_TCP_CONG_CUBIC=y
CONFIG_DEFAULT_TCP_CONG="cubic"
# CONFIG_TCP_MD5SIG is not set
CONFIG_IPV6=m
CONFIG_IPV6_PRIVACY=y
# CONFIG_IPV6_ROUTER_PREF is not set
# CONFIG_IPV6_OPTIMISTIC_DAD is not set
CONFIG_INET6_AH=m
CONFIG_INET6_ESP=m
CONFIG_INET6_IPCOMP=m
# CONFIG_IPV6_MIP6 is not set
CONFIG_INET6_XFRM_TUNNEL=m
CONFIG_INET6_TUNNEL=m
CONFIG_INET6_XFRM_MODE_TRANSPORT=m
CONFIG_INET6_XFRM_MODE_TUNNEL=m
CONFIG_INET6_XFRM_MODE_BEET=m
# CONFIG_INET6_XFRM_MODE_ROUTEOPTIMIZATION is not set
CONFIG_IPV6_SIT=m
# CONFIG_IPV6_SIT_6RD is not set
CONFIG_IPV6_NDISC_NODETYPE=y
CONFIG_IPV6_TUNNEL=m
# CONFIG_IPV6_MULTIPLE_TABLES is not set
# CONFIG_IPV6_MROUTE is not set
CONFIG_NETWORK_SECMARK=y
CONFIG_NETFILTER=y
# CONFIG_NETFILTER_DEBUG is not set
CONFIG_NETFILTER_ADVANCED=y
CONFIG_BRIDGE_NETFILTER=y

#
# Core Netfilter Configuration
#
CONFIG_NETFILTER_NETLINK=m
CONFIG_NETFILTER_NETLINK_QUEUE=m
CONFIG_NETFILTER_NETLINK_LOG=m
CONFIG_NF_CONNTRACK=m
# CONFIG_NF_CT_ACCT is not set
CONFIG_NF_CONNTRACK_MARK=y
# CONFIG_NF_CONNTRACK_SECMARK is not set
# CONFIG_NF_CONNTRACK_EVENTS is not set
CONFIG_NF_CT_PROTO_DCCP=m
CONFIG_NF_CT_PROTO_SCTP=m
# CONFIG_NF_CT_PROTO_UDPLITE is not set
# CONFIG_NF_CONNTRACK_AMANDA is not set
# CONFIG_NF_CONNTRACK_FTP is not set
# CONFIG_NF_CONNTRACK_H323 is not set
# CONFIG_NF_CONNTRACK_IRC is not set
# CONFIG_NF_CONNTRACK_NETBIOS_NS is not set
# CONFIG_NF_CONNTRACK_PPTP is not set
# CONFIG_NF_CONNTRACK_SANE is not set
# CONFIG_NF_CONNTRACK_SIP is not set
# CONFIG_NF_CONNTRACK_TFTP is not set
# CONFIG_NF_CT_NETLINK is not set
# CONFIG_NETFILTER_TPROXY is not set
CONFIG_NETFILTER_XTABLES=m
CONFIG_NETFILTER_XT_TARGET_CLASSIFY=m
# CONFIG_NETFILTER_XT_TARGET_CONNMARK is not set
# CONFIG_NETFILTER_XT_TARGET_CT is not set
CONFIG_NETFILTER_XT_TARGET_DSCP=m
CONFIG_NETFILTER_XT_TARGET_HL=m
CONFIG_NETFILTER_XT_TARGET_MARK=m
CONFIG_NETFILTER_XT_TARGET_NFLOG=m
CONFIG_NETFILTER_XT_TARGET_NFQUEUE=m
# CONFIG_NETFILTER_XT_TARGET_NOTRACK is not set
CONFIG_NETFILTER_XT_TARGET_RATEEST=m
CONFIG_NETFILTER_XT_TARGET_TRACE=m
CONFIG_NETFILTER_XT_TARGET_SECMARK=m
CONFIG_NETFILTER_XT_TARGET_TCPMSS=m
CONFIG_NETFILTER_XT_TARGET_TCPOPTSTRIP=m
CONFIG_NETFILTER_XT_MATCH_CLUSTER=m
CONFIG_NETFILTER_XT_MATCH_COMMENT=m
# CONFIG_NETFILTER_XT_MATCH_CONNBYTES is not set
# CONFIG_NETFILTER_XT_MATCH_CONNLIMIT is not set
# CONFIG_NETFILTER_XT_MATCH_CONNMARK is not set
# CONFIG_NETFILTER_XT_MATCH_CONNTRACK is not set
CONFIG_NETFILTER_XT_MATCH_DCCP=m
CONFIG_NETFILTER_XT_MATCH_DSCP=m
CONFIG_NETFILTER_XT_MATCH_ESP=m
CONFIG_NETFILTER_XT_MATCH_HASHLIMIT=m
# CONFIG_NETFILTER_XT_MATCH_HELPER is not set
CONFIG_NETFILTER_XT_MATCH_HL=m
CONFIG_NETFILTER_XT_MATCH_IPRANGE=m
CONFIG_NETFILTER_XT_MATCH_LENGTH=m
CONFIG_NETFILTER_XT_MATCH_LIMIT=m
CONFIG_NETFILTER_XT_MATCH_MAC=m
CONFIG_NETFILTER_XT_MATCH_MARK=m
CONFIG_NETFILTER_XT_MATCH_MULTIPORT=m
CONFIG_NETFILTER_XT_MATCH_OWNER=m
CONFIG_NETFILTER_XT_MATCH_POLICY=m
# CONFIG_NETFILTER_XT_MATCH_PHYSDEV is not set
CONFIG_NETFILTER_XT_MATCH_PKTTYPE=m
CONFIG_NETFILTER_XT_MATCH_QUOTA=m
CONFIG_NETFILTER_XT_MATCH_RATEEST=m
CONFIG_NETFILTER_XT_MATCH_REALM=m
CONFIG_NETFILTER_XT_MATCH_RECENT=m
# CONFIG_NETFILTER_XT_MATCH_RECENT_PROC_COMPAT is not set
CONFIG_NETFILTER_XT_MATCH_SCTP=m
# CONFIG_NETFILTER_XT_MATCH_STATE is not set
CONFIG_NETFILTER_XT_MATCH_STATISTIC=m
CONFIG_NETFILTER_XT_MATCH_STRING=m
CONFIG_NETFILTER_XT_MATCH_TCPMSS=m
CONFIG_NETFILTER_XT_MATCH_TIME=m
CONFIG_NETFILTER_XT_MATCH_U32=m
CONFIG_NETFILTER_XT_MATCH_OSF=m
# CONFIG_IP_VS is not set

#
# IP: Netfilter Configuration
#
CONFIG_NF_DEFRAG_IPV4=m
CONFIG_NF_CONNTRACK_IPV4=m
CONFIG_NF_CONNTRACK_PROC_COMPAT=y
CONFIG_IP_NF_QUEUE=m
CONFIG_IP_NF_IPTABLES=m
CONFIG_IP_NF_MATCH_ADDRTYPE=m
CONFIG_IP_NF_MATCH_AH=m
CONFIG_IP_NF_MATCH_ECN=m
CONFIG_IP_NF_MATCH_TTL=m
CONFIG_IP_NF_FILTER=m
CONFIG_IP_NF_TARGET_REJECT=m
CONFIG_IP_NF_TARGET_LOG=m
CONFIG_IP_NF_TARGET_ULOG=m
CONFIG_NF_NAT=m
CONFIG_NF_NAT_NEEDED=y
CONFIG_IP_NF_TARGET_MASQUERADE=m
CONFIG_IP_NF_TARGET_NETMAP=m
CONFIG_IP_NF_TARGET_REDIRECT=m
CONFIG_NF_NAT_SNMP_BASIC=m
CONFIG_NF_NAT_PROTO_DCCP=m
CONFIG_NF_NAT_PROTO_SCTP=m
# CONFIG_NF_NAT_FTP is not set
# CONFIG_NF_NAT_IRC is not set
# CONFIG_NF_NAT_TFTP is not set
# CONFIG_NF_NAT_AMANDA is not set
# CONFIG_NF_NAT_PPTP is not set
# CONFIG_NF_NAT_H323 is not set
# CONFIG_NF_NAT_SIP is not set
CONFIG_IP_NF_MANGLE=m
# CONFIG_IP_NF_TARGET_CLUSTERIP is not set
CONFIG_IP_NF_TARGET_ECN=m
CONFIG_IP_NF_TARGET_TTL=m
CONFIG_IP_NF_RAW=m
CONFIG_IP_NF_ARPTABLES=m
CONFIG_IP_NF_ARPFILTER=m
CONFIG_IP_NF_ARP_MANGLE=m

#
# IPv6: Netfilter Configuration
#
CONFIG_NF_CONNTRACK_IPV6=m
CONFIG_IP6_NF_QUEUE=m
CONFIG_IP6_NF_IPTABLES=m
CONFIG_IP6_NF_MATCH_AH=m
CONFIG_IP6_NF_MATCH_EUI64=m
CONFIG_IP6_NF_MATCH_FRAG=m
CONFIG_IP6_NF_MATCH_OPTS=m
CONFIG_IP6_NF_MATCH_HL=m
CONFIG_IP6_NF_MATCH_IPV6HEADER=m
CONFIG_IP6_NF_MATCH_MH=m
CONFIG_IP6_NF_MATCH_RT=m
CONFIG_IP6_NF_TARGET_HL=m
CONFIG_IP6_NF_TARGET_LOG=m
CONFIG_IP6_NF_FILTER=m
CONFIG_IP6_NF_TARGET_REJECT=m
CONFIG_IP6_NF_MANGLE=m
CONFIG_IP6_NF_RAW=m
# CONFIG_BRIDGE_NF_EBTABLES is not set
CONFIG_IP_DCCP=m
CONFIG_INET_DCCP_DIAG=m

#
# DCCP CCIDs Configuration (EXPERIMENTAL)
#
# CONFIG_IP_DCCP_CCID2_DEBUG is not set
CONFIG_IP_DCCP_CCID3=y
# CONFIG_IP_DCCP_CCID3_DEBUG is not set
CONFIG_IP_DCCP_CCID3_RTO=100
CONFIG_IP_DCCP_TFRC_LIB=y

#
# DCCP Kernel Hacking
#
# CONFIG_IP_DCCP_DEBUG is not set
CONFIG_IP_SCTP=m
# CONFIG_SCTP_DBG_MSG is not set
# CONFIG_SCTP_DBG_OBJCNT is not set
# CONFIG_SCTP_HMAC_NONE is not set
# CONFIG_SCTP_HMAC_SHA1 is not set
CONFIG_SCTP_HMAC_MD5=y
# CONFIG_RDS is not set
# CONFIG_TIPC is not set
# CONFIG_ATM is not set
CONFIG_STP=m
CONFIG_BRIDGE=m
CONFIG_BRIDGE_IGMP_SNOOPING=y
# CONFIG_NET_DSA is not set
# CONFIG_VLAN_8021Q is not set
# CONFIG_DECNET is not set
CONFIG_LLC=m
# CONFIG_LLC2 is not set
CONFIG_IPX=m
# CONFIG_IPX_INTERN is not set
# CONFIG_ATALK is not set
# CONFIG_X25 is not set
# CONFIG_LAPB is not set
# CONFIG_ECONET is not set
# CONFIG_WAN_ROUTER is not set
# CONFIG_PHONET is not set
# CONFIG_IEEE802154 is not set
CONFIG_NET_SCHED=y

#
# Queueing/Scheduling
#
CONFIG_NET_SCH_CBQ=m
CONFIG_NET_SCH_HTB=m
CONFIG_NET_SCH_HFSC=m
CONFIG_NET_SCH_PRIO=m
# CONFIG_NET_SCH_MULTIQ is not set
CONFIG_NET_SCH_RED=m
CONFIG_NET_SCH_SFQ=m
CONFIG_NET_SCH_TEQL=m
CONFIG_NET_SCH_TBF=m
CONFIG_NET_SCH_GRED=m
CONFIG_NET_SCH_DSMARK=m
CONFIG_NET_SCH_NETEM=m
# CONFIG_NET_SCH_DRR is not set
CONFIG_NET_SCH_INGRESS=m

#
# Classification
#
CONFIG_NET_CLS=y
CONFIG_NET_CLS_BASIC=m
CONFIG_NET_CLS_TCINDEX=m
CONFIG_NET_CLS_ROUTE4=m
CONFIG_NET_CLS_ROUTE=y
CONFIG_NET_CLS_FW=m
CONFIG_NET_CLS_U32=m
CONFIG_CLS_U32_PERF=y
CONFIG_CLS_U32_MARK=y
CONFIG_NET_CLS_RSVP=m
CONFIG_NET_CLS_RSVP6=m
# CONFIG_NET_CLS_FLOW is not set
CONFIG_NET_EMATCH=y
CONFIG_NET_EMATCH_STACK=32
CONFIG_NET_EMATCH_CMP=m
CONFIG_NET_EMATCH_NBYTE=m
CONFIG_NET_EMATCH_U32=m
CONFIG_NET_EMATCH_META=m
CONFIG_NET_EMATCH_TEXT=m
CONFIG_NET_CLS_ACT=y
CONFIG_NET_ACT_POLICE=m
CONFIG_NET_ACT_GACT=m
CONFIG_GACT_PROB=y
CONFIG_NET_ACT_MIRRED=m
CONFIG_NET_ACT_IPT=m
# CONFIG_NET_ACT_NAT is not set
CONFIG_NET_ACT_PEDIT=m
CONFIG_NET_ACT_SIMP=m
# CONFIG_NET_ACT_SKBEDIT is not set
CONFIG_NET_CLS_IND=y
CONFIG_NET_SCH_FIFO=y
# CONFIG_DCB is not set

#
# Network testing
#
# CONFIG_NET_PKTGEN is not set
# CONFIG_NET_DROP_MONITOR is not set
# CONFIG_HAMRADIO is not set
# CONFIG_CAN is not set
# CONFIG_IRDA is not set
# CONFIG_BT is not set
CONFIG_AF_RXRPC=m
# CONFIG_AF_RXRPC_DEBUG is not set
# CONFIG_RXKAD is not set
# CONFIG_WIRELESS is not set
# CONFIG_WIMAX is not set
CONFIG_RFKILL=m
CONFIG_RFKILL_INPUT=y
# CONFIG_NET_9P is not set

#
# Device Drivers
#

#
# Generic Driver Options
#
CONFIG_UEVENT_HELPER_PATH="/sbin/hotplug"
# CONFIG_DEVTMPFS is not set
CONFIG_STANDALONE=y
CONFIG_PREVENT_FIRMWARE_BUILD=y
CONFIG_FW_LOADER=y
CONFIG_FIRMWARE_IN_KERNEL=y
CONFIG_EXTRA_FIRMWARE=""
# CONFIG_DEBUG_DRIVER is not set
# CONFIG_DEBUG_DEVRES is not set
# CONFIG_SYS_HYPERVISOR is not set
# CONFIG_CONNECTOR is not set
# CONFIG_MTD is not set
# CONFIG_PARPORT is not set
CONFIG_PNP=y
# CONFIG_PNP_DEBUG_MESSAGES is not set

#
# Protocols
#
CONFIG_PNPACPI=y
CONFIG_BLK_DEV=y
# CONFIG_BLK_DEV_FD is not set
# CONFIG_BLK_CPQ_DA is not set
# CONFIG_BLK_CPQ_CISS_DA is not set
# CONFIG_BLK_DEV_DAC960 is not set
# CONFIG_BLK_DEV_UMEM is not set
# CONFIG_BLK_DEV_COW_COMMON is not set
CONFIG_BLK_DEV_LOOP=y
CONFIG_BLK_DEV_CRYPTOLOOP=y

#
# DRBD disabled because PROC_FS, INET or CONNECTOR not selected
#
# CONFIG_BLK_DEV_NBD is not set
# CONFIG_BLK_DEV_SX8 is not set
# CONFIG_BLK_DEV_UB is not set
# CONFIG_BLK_DEV_RAM is not set
CONFIG_CDROM_PKTCDVD=m
CONFIG_CDROM_PKTCDVD_BUFFERS=8
# CONFIG_CDROM_PKTCDVD_WCACHE is not set
# CONFIG_ATA_OVER_ETH is not set
# CONFIG_BLK_DEV_HD is not set
CONFIG_MISC_DEVICES=y
# CONFIG_AD525X_DPOT is not set
# CONFIG_IBM_ASM is not set
# CONFIG_PHANTOM is not set
# CONFIG_SGI_IOC4 is not set
# CONFIG_TIFM_CORE is not set
# CONFIG_ICS932S401 is not set
# CONFIG_ENCLOSURE_SERVICES is not set
# CONFIG_CS5535_MFGPT is not set
# CONFIG_HP_ILO is not set
# CONFIG_ISL29003 is not set
# CONFIG_SENSORS_TSL2550 is not set
# CONFIG_DS1682 is not set
# CONFIG_C2PORT is not set

#
# EEPROM support
#
# CONFIG_EEPROM_AT24 is not set
# CONFIG_EEPROM_LEGACY is not set
# CONFIG_EEPROM_MAX6875 is not set
# CONFIG_EEPROM_93CX6 is not set
# CONFIG_CB710_CORE is not set
CONFIG_HAVE_IDE=y
CONFIG_IDE=y

#
# Please see Documentation/ide/ide.txt for help/info on IDE drives
#
CONFIG_IDE_XFER_MODE=y
CONFIG_IDE_ATAPI=y
# CONFIG_BLK_DEV_IDE_SATA is not set
CONFIG_IDE_GD=y
CONFIG_IDE_GD_ATA=y
CONFIG_IDE_GD_ATAPI=y
CONFIG_BLK_DEV_IDECD=y
CONFIG_BLK_DEV_IDECD_VERBOSE_ERRORS=y
CONFIG_BLK_DEV_IDETAPE=y
CONFIG_BLK_DEV_IDEACPI=y
CONFIG_IDE_TASK_IOCTL=y
# CONFIG_IDE_PROC_FS is not set

#
# IDE chipset support/bugfixes
#
CONFIG_IDE_GENERIC=m
# CONFIG_BLK_DEV_PLATFORM is not set
# CONFIG_BLK_DEV_CMD640 is not set
# CONFIG_BLK_DEV_IDEPNP is not set
CONFIG_BLK_DEV_IDEDMA_SFF=y

#
# PCI IDE chipsets support
#
CONFIG_BLK_DEV_IDEPCI=y
CONFIG_IDEPCI_PCIBUS_ORDER=y
# CONFIG_BLK_DEV_OFFBOARD is not set
CONFIG_BLK_DEV_GENERIC=m
# CONFIG_BLK_DEV_OPTI621 is not set
# CONFIG_BLK_DEV_RZ1000 is not set
CONFIG_BLK_DEV_IDEDMA_PCI=y
# CONFIG_BLK_DEV_AEC62XX is not set
# CONFIG_BLK_DEV_ALI15X3 is not set
# CONFIG_BLK_DEV_AMD74XX is not set
CONFIG_BLK_DEV_ATIIXP=y
# CONFIG_BLK_DEV_CMD64X is not set
# CONFIG_BLK_DEV_TRIFLEX is not set
# CONFIG_BLK_DEV_CS5520 is not set
# CONFIG_BLK_DEV_CS5530 is not set
# CONFIG_BLK_DEV_HPT366 is not set
# CONFIG_BLK_DEV_JMICRON is not set
# CONFIG_BLK_DEV_SC1200 is not set
# CONFIG_BLK_DEV_PIIX is not set
# CONFIG_BLK_DEV_IT8172 is not set
# CONFIG_BLK_DEV_IT8213 is not set
# CONFIG_BLK_DEV_IT821X is not set
# CONFIG_BLK_DEV_NS87415 is not set
# CONFIG_BLK_DEV_PDC202XX_OLD is not set
# CONFIG_BLK_DEV_PDC202XX_NEW is not set
# CONFIG_BLK_DEV_SVWKS is not set
# CONFIG_BLK_DEV_SIIMAGE is not set
# CONFIG_BLK_DEV_SIS5513 is not set
# CONFIG_BLK_DEV_SLC90E66 is not set
# CONFIG_BLK_DEV_TRM290 is not set
# CONFIG_BLK_DEV_VIA82CXXX is not set
# CONFIG_BLK_DEV_TC86C001 is not set
CONFIG_BLK_DEV_IDEDMA=y

#
# SCSI device support
#
CONFIG_SCSI_MOD=y
# CONFIG_RAID_ATTRS is not set
CONFIG_SCSI=y
CONFIG_SCSI_DMA=y
# CONFIG_SCSI_TGT is not set
# CONFIG_SCSI_NETLINK is not set
# CONFIG_SCSI_PROC_FS is not set

#
# SCSI support type (disk, tape, CD-ROM)
#
CONFIG_BLK_DEV_SD=y
# CONFIG_CHR_DEV_ST is not set
# CONFIG_CHR_DEV_OSST is not set
# CONFIG_BLK_DEV_SR is not set
# CONFIG_CHR_DEV_SG is not set
# CONFIG_CHR_DEV_SCH is not set
# CONFIG_SCSI_MULTI_LUN is not set
# CONFIG_SCSI_CONSTANTS is not set
# CONFIG_SCSI_LOGGING is not set
# CONFIG_SCSI_SCAN_ASYNC is not set
CONFIG_SCSI_WAIT_SCAN=m

#
# SCSI Transports
#
# CONFIG_SCSI_SPI_ATTRS is not set
# CONFIG_SCSI_FC_ATTRS is not set
# CONFIG_SCSI_ISCSI_ATTRS is not set
# CONFIG_SCSI_SAS_ATTRS is not set
# CONFIG_SCSI_SAS_LIBSAS is not set
# CONFIG_SCSI_SRP_ATTRS is not set
# CONFIG_SCSI_LOWLEVEL is not set
# CONFIG_SCSI_DH is not set
# CONFIG_SCSI_OSD_INITIATOR is not set
CONFIG_ATA=y
# CONFIG_ATA_NONSTANDARD is not set
CONFIG_ATA_VERBOSE_ERROR=y
CONFIG_ATA_ACPI=y
CONFIG_SATA_PMP=y
CONFIG_SATA_AHCI=y
# CONFIG_SATA_SIL24 is not set
# CONFIG_ATA_SFF is not set
CONFIG_MD=y
# CONFIG_BLK_DEV_MD is not set
CONFIG_BLK_DEV_DM=m
# CONFIG_DM_DEBUG is not set
CONFIG_DM_CRYPT=m
# CONFIG_DM_SNAPSHOT is not set
# CONFIG_DM_MIRROR is not set
# CONFIG_DM_ZERO is not set
# CONFIG_DM_MULTIPATH is not set
# CONFIG_DM_DELAY is not set
# CONFIG_DM_UEVENT is not set
# CONFIG_FUSION is not set

#
# IEEE 1394 (FireWire) support
#

#
# You can enable one or both FireWire driver stacks.
#

#
# The newer stack is recommended.
#
# CONFIG_FIREWIRE is not set
CONFIG_IEEE1394=m
CONFIG_IEEE1394_OHCI1394=m
CONFIG_IEEE1394_PCILYNX=m
CONFIG_IEEE1394_SBP2=m
# CONFIG_IEEE1394_SBP2_PHYS_DMA is not set
CONFIG_IEEE1394_ETH1394_ROM_ENTRY=y
CONFIG_IEEE1394_ETH1394=m
CONFIG_IEEE1394_RAWIO=m
CONFIG_IEEE1394_VIDEO1394=m
CONFIG_IEEE1394_DV1394=m
# CONFIG_IEEE1394_VERBOSEDEBUG is not set
# CONFIG_I2O is not set
# CONFIG_MACINTOSH_DRIVERS is not set
CONFIG_NETDEVICES=y
# CONFIG_IFB is not set
# CONFIG_DUMMY is not set
# CONFIG_BONDING is not set
# CONFIG_MACVLAN is not set
# CONFIG_EQUALIZER is not set
CONFIG_TUN=m
# CONFIG_VETH is not set
# CONFIG_NET_SB1000 is not set
# CONFIG_ARCNET is not set
CONFIG_PHYLIB=m

#
# MII PHY device drivers
#
CONFIG_MARVELL_PHY=m
CONFIG_DAVICOM_PHY=m
CONFIG_QSEMI_PHY=m
CONFIG_LXT_PHY=m
CONFIG_CICADA_PHY=m
CONFIG_VITESSE_PHY=m
CONFIG_SMSC_PHY=m
# CONFIG_BROADCOM_PHY is not set
# CONFIG_ICPLUS_PHY is not set
# CONFIG_REALTEK_PHY is not set
# CONFIG_NATIONAL_PHY is not set
# CONFIG_STE10XP is not set
# CONFIG_LSI_ET1011C_PHY is not set
# CONFIG_MDIO_BITBANG is not set
CONFIG_NET_ETHERNET=y
CONFIG_MII=y
# CONFIG_HAPPYMEAL is not set
# CONFIG_SUNGEM is not set
# CONFIG_CASSINI is not set
# CONFIG_NET_VENDOR_3COM is not set
# CONFIG_ETHOC is not set
# CONFIG_DNET is not set
# CONFIG_NET_TULIP is not set
# CONFIG_HP100 is not set
# CONFIG_IBM_NEW_EMAC_ZMII is not set
# CONFIG_IBM_NEW_EMAC_RGMII is not set
# CONFIG_IBM_NEW_EMAC_TAH is not set
# CONFIG_IBM_NEW_EMAC_EMAC4 is not set
# CONFIG_IBM_NEW_EMAC_NO_FLOW_CTRL is not set
# CONFIG_IBM_NEW_EMAC_MAL_CLR_ICINTSTAT is not set
# CONFIG_IBM_NEW_EMAC_MAL_COMMON_ERR is not set
CONFIG_NET_PCI=y
# CONFIG_PCNET32 is not set
# CONFIG_AMD8111_ETH is not set
# CONFIG_ADAPTEC_STARFIRE is not set
# CONFIG_KSZ884X_PCI is not set
# CONFIG_B44 is not set
# CONFIG_FORCEDETH is not set
# CONFIG_E100 is not set
# CONFIG_FEALNX is not set
# CONFIG_NATSEMI is not set
# CONFIG_NE2K_PCI is not set
# CONFIG_8139CP is not set
CONFIG_8139TOO=y
# CONFIG_8139TOO_PIO is not set
# CONFIG_8139TOO_TUNE_TWISTER is not set
# CONFIG_8139TOO_8129 is not set
# CONFIG_8139_OLD_RX_RESET is not set
# CONFIG_R6040 is not set
# CONFIG_SIS900 is not set
# CONFIG_EPIC100 is not set
# CONFIG_SMSC9420 is not set
# CONFIG_SUNDANCE is not set
# CONFIG_TLAN is not set
# CONFIG_KS8842 is not set
# CONFIG_KS8851_MLL is not set
# CONFIG_VIA_RHINE is not set
# CONFIG_SC92031 is not set
# CONFIG_ATL2 is not set
# CONFIG_NETDEV_1000 is not set
# CONFIG_NETDEV_10000 is not set
# CONFIG_TR is not set
# CONFIG_WLAN is not set

#
# Enable WiMAX (Networking options) to see the WiMAX drivers
#

#
# USB Network Adapters
#
# CONFIG_USB_CATC is not set
# CONFIG_USB_KAWETH is not set
# CONFIG_USB_PEGASUS is not set
# CONFIG_USB_RTL8150 is not set
# CONFIG_USB_USBNET is not set
# CONFIG_USB_HSO is not set
# CONFIG_WAN is not set
# CONFIG_FDDI is not set
# CONFIG_HIPPI is not set
CONFIG_PPP=m
CONFIG_PPP_MULTILINK=y
CONFIG_PPP_FILTER=y
CONFIG_PPP_ASYNC=m
CONFIG_PPP_SYNC_TTY=m
CONFIG_PPP_DEFLATE=m
CONFIG_PPP_BSDCOMP=m
CONFIG_PPP_MPPE=m
CONFIG_PPPOE=m
# CONFIG_PPPOL2TP is not set
CONFIG_SLIP=m
# CONFIG_SLIP_COMPRESSED is not set
CONFIG_SLHC=m
# CONFIG_SLIP_SMART is not set
# CONFIG_SLIP_MODE_SLIP6 is not set
# CONFIG_NET_FC is not set
CONFIG_NETCONSOLE=y
CONFIG_NETCONSOLE_DYNAMIC=y
CONFIG_NETPOLL=y
CONFIG_NETPOLL_TRAP=y
CONFIG_NET_POLL_CONTROLLER=y
# CONFIG_VMXNET3 is not set
# CONFIG_ISDN is not set
# CONFIG_PHONE is not set

#
# Input device support
#
CONFIG_INPUT=y
# CONFIG_INPUT_FF_MEMLESS is not set
CONFIG_INPUT_POLLDEV=m
# CONFIG_INPUT_SPARSEKMAP is not set

#
# Userland interfaces
#
CONFIG_INPUT_MOUSEDEV=y
CONFIG_INPUT_MOUSEDEV_PSAUX=y
CONFIG_INPUT_MOUSEDEV_SCREEN_X=1024
CONFIG_INPUT_MOUSEDEV_SCREEN_Y=768
# CONFIG_INPUT_JOYDEV is not set
CONFIG_INPUT_EVDEV=y
# CONFIG_INPUT_EVBUG is not set

#
# Input Device Drivers
#
CONFIG_INPUT_KEYBOARD=y
# CONFIG_KEYBOARD_ADP5588 is not set
CONFIG_KEYBOARD_ATKBD=y
# CONFIG_QT2160 is not set
# CONFIG_KEYBOARD_LKKBD is not set
# CONFIG_KEYBOARD_MAX7359 is not set
# CONFIG_KEYBOARD_NEWTON is not set
# CONFIG_KEYBOARD_OPENCORES is not set
# CONFIG_KEYBOARD_STOWAWAY is not set
# CONFIG_KEYBOARD_SUNKBD is not set
# CONFIG_KEYBOARD_XTKBD is not set
CONFIG_INPUT_MOUSE=y
CONFIG_MOUSE_PS2=m
CONFIG_MOUSE_PS2_ALPS=y
CONFIG_MOUSE_PS2_LOGIPS2PP=y
CONFIG_MOUSE_PS2_SYNAPTICS=y
CONFIG_MOUSE_PS2_LIFEBOOK=y
CONFIG_MOUSE_PS2_TRACKPOINT=y
# CONFIG_MOUSE_PS2_ELANTECH is not set
# CONFIG_MOUSE_PS2_SENTELIC is not set
# CONFIG_MOUSE_PS2_TOUCHKIT is not set
# CONFIG_MOUSE_SERIAL is not set
# CONFIG_MOUSE_APPLETOUCH is not set
# CONFIG_MOUSE_BCM5974 is not set
# CONFIG_MOUSE_VSXXXAA is not set
# CONFIG_MOUSE_SYNAPTICS_I2C is not set
# CONFIG_INPUT_JOYSTICK is not set
# CONFIG_INPUT_TABLET is not set
# CONFIG_INPUT_TOUCHSCREEN is not set
CONFIG_INPUT_MISC=y
CONFIG_INPUT_PCSPKR=m
# CONFIG_INPUT_ATLAS_BTNS is not set
# CONFIG_INPUT_ATI_REMOTE is not set
# CONFIG_INPUT_ATI_REMOTE2 is not set
# CONFIG_INPUT_KEYSPAN_REMOTE is not set
# CONFIG_INPUT_POWERMATE is not set
# CONFIG_INPUT_YEALINK is not set
# CONFIG_INPUT_CM109 is not set
# CONFIG_INPUT_UINPUT is not set
# CONFIG_INPUT_WINBOND_CIR is not set

#
# Hardware I/O ports
#
CONFIG_SERIO=y
CONFIG_SERIO_I8042=y
# CONFIG_SERIO_SERPORT is not set
# CONFIG_SERIO_CT82C710 is not set
# CONFIG_SERIO_PCIPS2 is not set
CONFIG_SERIO_LIBPS2=y
# CONFIG_SERIO_RAW is not set
# CONFIG_SERIO_ALTERA_PS2 is not set
# CONFIG_GAMEPORT is not set

#
# Character devices
#
CONFIG_VT=y
CONFIG_CONSOLE_TRANSLATIONS=y
CONFIG_VT_CONSOLE=y
CONFIG_HW_CONSOLE=y
# CONFIG_VT_HW_CONSOLE_BINDING is not set
CONFIG_DEVKMEM=y
# CONFIG_SERIAL_NONSTANDARD is not set
# CONFIG_NOZOMI is not set

#
# Serial drivers
#
CONFIG_SERIAL_8250=m
CONFIG_FIX_EARLYCON_MEM=y
CONFIG_SERIAL_8250_PCI=m
CONFIG_SERIAL_8250_PNP=m
CONFIG_SERIAL_8250_NR_UARTS=16
CONFIG_SERIAL_8250_RUNTIME_UARTS=4
CONFIG_SERIAL_8250_EXTENDED=y
CONFIG_SERIAL_8250_MANY_PORTS=y
CONFIG_SERIAL_8250_SHARE_IRQ=y
# CONFIG_SERIAL_8250_DETECT_IRQ is not set
# CONFIG_SERIAL_8250_RSA is not set

#
# Non-8250 serial port support
#
CONFIG_SERIAL_CORE=m
# CONFIG_SERIAL_JSM is not set
# CONFIG_SERIAL_TIMBERDALE is not set
CONFIG_UNIX98_PTYS=y
# CONFIG_DEVPTS_MULTIPLE_INSTANCES is not set
# CONFIG_LEGACY_PTYS is not set
# CONFIG_IPMI_HANDLER is not set
# CONFIG_HW_RANDOM is not set
# CONFIG_NVRAM is not set
# CONFIG_R3964 is not set
# CONFIG_APPLICOM is not set
# CONFIG_MWAVE is not set
# CONFIG_PC8736x_GPIO is not set
# CONFIG_RAW_DRIVER is not set
CONFIG_HPET=y
CONFIG_HPET_MMAP=y
# CONFIG_HANGCHECK_TIMER is not set
# CONFIG_TCG_TPM is not set
# CONFIG_TELCLOCK is not set
CONFIG_DEVPORT=y
CONFIG_I2C=y
CONFIG_I2C_BOARDINFO=y
CONFIG_I2C_COMPAT=y
CONFIG_I2C_CHARDEV=m
CONFIG_I2C_HELPER_AUTO=y
CONFIG_I2C_ALGOBIT=y

#
# I2C Hardware Bus support
#

#
# PC SMBus host controller drivers
#
# CONFIG_I2C_ALI1535 is not set
# CONFIG_I2C_ALI1563 is not set
# CONFIG_I2C_ALI15X3 is not set
CONFIG_I2C_AMD756=m
# CONFIG_I2C_AMD756_S4882 is not set
CONFIG_I2C_AMD8111=m
# CONFIG_I2C_I801 is not set
# CONFIG_I2C_ISCH is not set
# CONFIG_I2C_PIIX4 is not set
# CONFIG_I2C_NFORCE2 is not set
# CONFIG_I2C_SIS5595 is not set
# CONFIG_I2C_SIS630 is not set
# CONFIG_I2C_SIS96X is not set
# CONFIG_I2C_VIA is not set
# CONFIG_I2C_VIAPRO is not set

#
# ACPI drivers
#
# CONFIG_I2C_SCMI is not set

#
# I2C system bus drivers (mostly embedded / system-on-chip)
#
# CONFIG_I2C_OCORES is not set
# CONFIG_I2C_SIMTEC is not set
# CONFIG_I2C_XILINX is not set

#
# External I2C/SMBus adapter drivers
#
# CONFIG_I2C_PARPORT_LIGHT is not set
# CONFIG_I2C_TAOS_EVM is not set
# CONFIG_I2C_TINY_USB is not set

#
# Other I2C/SMBus bus drivers
#
# CONFIG_I2C_PCA_PLATFORM is not set
# CONFIG_I2C_STUB is not set
# CONFIG_I2C_DEBUG_CORE is not set
# CONFIG_I2C_DEBUG_ALGO is not set
# CONFIG_I2C_DEBUG_BUS is not set
# CONFIG_SPI is not set

#
# PPS support
#
# CONFIG_PPS is not set
CONFIG_ARCH_WANT_OPTIONAL_GPIOLIB=y
# CONFIG_GPIOLIB is not set
# CONFIG_W1 is not set
CONFIG_POWER_SUPPLY=y
# CONFIG_POWER_SUPPLY_DEBUG is not set
# CONFIG_PDA_POWER is not set
# CONFIG_BATTERY_DS2760 is not set
# CONFIG_BATTERY_DS2782 is not set
# CONFIG_BATTERY_BQ27x00 is not set
# CONFIG_BATTERY_MAX17040 is not set
CONFIG_HWMON=y
CONFIG_HWMON_VID=m
# CONFIG_HWMON_DEBUG_CHIP is not set

#
# Native drivers
#
# CONFIG_SENSORS_ABITUGURU is not set
# CONFIG_SENSORS_ABITUGURU3 is not set
# CONFIG_SENSORS_AD7414 is not set
# CONFIG_SENSORS_AD7418 is not set
# CONFIG_SENSORS_ADM1021 is not set
# CONFIG_SENSORS_ADM1025 is not set
# CONFIG_SENSORS_ADM1026 is not set
# CONFIG_SENSORS_ADM1029 is not set
# CONFIG_SENSORS_ADM1031 is not set
# CONFIG_SENSORS_ADM9240 is not set
# CONFIG_SENSORS_ADT7411 is not set
# CONFIG_SENSORS_ADT7462 is not set
# CONFIG_SENSORS_ADT7470 is not set
# CONFIG_SENSORS_ADT7475 is not set
# CONFIG_SENSORS_ASC7621 is not set
CONFIG_SENSORS_K8TEMP=m
CONFIG_SENSORS_K10TEMP=m
CONFIG_SENSORS_ASB100=m
# CONFIG_SENSORS_ATXP1 is not set
# CONFIG_SENSORS_DS1621 is not set
# CONFIG_SENSORS_I5K_AMB is not set
# CONFIG_SENSORS_F71805F is not set
# CONFIG_SENSORS_F71882FG is not set
# CONFIG_SENSORS_F75375S is not set
# CONFIG_SENSORS_FSCHMD is not set
# CONFIG_SENSORS_G760A is not set
# CONFIG_SENSORS_GL518SM is not set
# CONFIG_SENSORS_GL520SM is not set
# CONFIG_SENSORS_CORETEMP is not set
# CONFIG_SENSORS_IT87 is not set
# CONFIG_SENSORS_LM63 is not set
# CONFIG_SENSORS_LM73 is not set
# CONFIG_SENSORS_LM75 is not set
# CONFIG_SENSORS_LM77 is not set
# CONFIG_SENSORS_LM78 is not set
# CONFIG_SENSORS_LM80 is not set
# CONFIG_SENSORS_LM83 is not set
# CONFIG_SENSORS_LM85 is not set
# CONFIG_SENSORS_LM87 is not set
# CONFIG_SENSORS_LM90 is not set
# CONFIG_SENSORS_LM92 is not set
# CONFIG_SENSORS_LM93 is not set
# CONFIG_SENSORS_LTC4215 is not set
# CONFIG_SENSORS_LTC4245 is not set
# CONFIG_SENSORS_LM95241 is not set
# CONFIG_SENSORS_MAX1619 is not set
# CONFIG_SENSORS_MAX6650 is not set
# CONFIG_SENSORS_PC87360 is not set
# CONFIG_SENSORS_PC87427 is not set
# CONFIG_SENSORS_PCF8591 is not set
# CONFIG_SENSORS_SIS5595 is not set
# CONFIG_SENSORS_DME1737 is not set
# CONFIG_SENSORS_SMSC47M1 is not set
# CONFIG_SENSORS_SMSC47M192 is not set
# CONFIG_SENSORS_SMSC47B397 is not set
# CONFIG_SENSORS_ADS7828 is not set
# CONFIG_SENSORS_AMC6821 is not set
# CONFIG_SENSORS_THMC50 is not set
# CONFIG_SENSORS_TMP401 is not set
# CONFIG_SENSORS_TMP421 is not set
# CONFIG_SENSORS_VIA_CPUTEMP is not set
# CONFIG_SENSORS_VIA686A is not set
# CONFIG_SENSORS_VT1211 is not set
# CONFIG_SENSORS_VT8231 is not set
# CONFIG_SENSORS_W83781D is not set
# CONFIG_SENSORS_W83791D is not set
# CONFIG_SENSORS_W83792D is not set
# CONFIG_SENSORS_W83793 is not set
# CONFIG_SENSORS_W83L785TS is not set
# CONFIG_SENSORS_W83L786NG is not set
# CONFIG_SENSORS_W83627HF is not set
# CONFIG_SENSORS_W83627EHF is not set
# CONFIG_SENSORS_HDAPS is not set
# CONFIG_SENSORS_LIS3_I2C is not set
# CONFIG_SENSORS_APPLESMC is not set

#
# ACPI drivers
#
# CONFIG_SENSORS_ATK0110 is not set
# CONFIG_SENSORS_LIS3LV02D is not set
CONFIG_THERMAL=y
# CONFIG_THERMAL_HWMON is not set
# CONFIG_WATCHDOG is not set
CONFIG_SSB_POSSIBLE=y

#
# Sonics Silicon Backplane
#
# CONFIG_SSB is not set

#
# Multifunction device drivers
#
# CONFIG_MFD_CORE is not set
# CONFIG_MFD_88PM860X is not set
# CONFIG_MFD_SM501 is not set
# CONFIG_HTC_PASIC3 is not set
# CONFIG_TWL4030_CORE is not set
# CONFIG_MFD_TMIO is not set
# CONFIG_PMIC_DA903X is not set
# CONFIG_PMIC_ADP5520 is not set
# CONFIG_MFD_MAX8925 is not set
# CONFIG_MFD_WM8400 is not set
# CONFIG_MFD_WM831X is not set
# CONFIG_MFD_WM8350_I2C is not set
# CONFIG_MFD_WM8994 is not set
# CONFIG_MFD_PCF50633 is not set
# CONFIG_AB3100_CORE is not set
# CONFIG_LPC_SCH is not set
# CONFIG_REGULATOR is not set
# CONFIG_MEDIA_SUPPORT is not set

#
# Graphics support
#
CONFIG_AGP=y
CONFIG_AGP_AMD64=y
# CONFIG_AGP_INTEL is not set
# CONFIG_AGP_SIS is not set
# CONFIG_AGP_VIA is not set
CONFIG_VGA_ARB=y
CONFIG_VGA_ARB_MAX_GPUS=16
# CONFIG_VGA_SWITCHEROO is not set
CONFIG_DRM=y
CONFIG_DRM_KMS_HELPER=y
CONFIG_DRM_TTM=y
# CONFIG_DRM_TDFX is not set
# CONFIG_DRM_R128 is not set
CONFIG_DRM_RADEON=y
# CONFIG_DRM_RADEON_KMS is not set
# CONFIG_DRM_MGA is not set
# CONFIG_DRM_SIS is not set
# CONFIG_DRM_VIA is not set
# CONFIG_DRM_SAVAGE is not set
# CONFIG_VGASTATE is not set
# CONFIG_VIDEO_OUTPUT_CONTROL is not set
CONFIG_FB=y
# CONFIG_FIRMWARE_EDID is not set
# CONFIG_FB_DDC is not set
# CONFIG_FB_BOOT_VESA_SUPPORT is not set
CONFIG_FB_CFB_FILLRECT=y
CONFIG_FB_CFB_COPYAREA=y
CONFIG_FB_CFB_IMAGEBLIT=y
# CONFIG_FB_CFB_REV_PIXELS_IN_BYTE is not set
# CONFIG_FB_SYS_FILLRECT is not set
# CONFIG_FB_SYS_COPYAREA is not set
# CONFIG_FB_SYS_IMAGEBLIT is not set
# CONFIG_FB_FOREIGN_ENDIAN is not set
# CONFIG_FB_SYS_FOPS is not set
# CONFIG_FB_SVGALIB is not set
# CONFIG_FB_MACMODES is not set
# CONFIG_FB_BACKLIGHT is not set
CONFIG_FB_MODE_HELPERS=y
# CONFIG_FB_TILEBLITTING is not set

#
# Frame buffer hardware drivers
#
# CONFIG_FB_CIRRUS is not set
# CONFIG_FB_PM2 is not set
# CONFIG_FB_CYBER2000 is not set
# CONFIG_FB_ARC is not set
# CONFIG_FB_ASILIANT is not set
# CONFIG_FB_IMSTT is not set
# CONFIG_FB_VGA16 is not set
# CONFIG_FB_VESA is not set
# CONFIG_FB_N411 is not set
# CONFIG_FB_HGA is not set
# CONFIG_FB_S1D13XXX is not set
# CONFIG_FB_NVIDIA is not set
# CONFIG_FB_RIVA is not set
# CONFIG_FB_LE80578 is not set
# CONFIG_FB_MATROX is not set
# CONFIG_FB_RADEON is not set
# CONFIG_FB_ATY128 is not set
# CONFIG_FB_ATY is not set
# CONFIG_FB_S3 is not set
# CONFIG_FB_SAVAGE is not set
# CONFIG_FB_SIS is not set
# CONFIG_FB_VIA is not set
# CONFIG_FB_NEOMAGIC is not set
# CONFIG_FB_KYRO is not set
# CONFIG_FB_3DFX is not set
# CONFIG_FB_VOODOO1 is not set
# CONFIG_FB_VT8623 is not set
# CONFIG_FB_TRIDENT is not set
# CONFIG_FB_ARK is not set
# CONFIG_FB_PM3 is not set
# CONFIG_FB_CARMINE is not set
# CONFIG_FB_GEODE is not set
# CONFIG_FB_VIRTUAL is not set
# CONFIG_FB_METRONOME is not set
# CONFIG_FB_MB862XX is not set
# CONFIG_FB_BROADSHEET is not set
CONFIG_BACKLIGHT_LCD_SUPPORT=y
# CONFIG_LCD_CLASS_DEVICE is not set
CONFIG_BACKLIGHT_CLASS_DEVICE=y
# CONFIG_BACKLIGHT_GENERIC is not set
# CONFIG_BACKLIGHT_PROGEAR is not set
# CONFIG_BACKLIGHT_MBP_NVIDIA is not set
# CONFIG_BACKLIGHT_SAHARA is not set

#
# Display device support
#
# CONFIG_DISPLAY_SUPPORT is not set

#
# Console display driver support
#
CONFIG_VGA_CONSOLE=y
CONFIG_VGACON_SOFT_SCROLLBACK=y
CONFIG_VGACON_SOFT_SCROLLBACK_SIZE=64
CONFIG_DUMMY_CONSOLE=y
CONFIG_FRAMEBUFFER_CONSOLE=y
# CONFIG_FRAMEBUFFER_CONSOLE_DETECT_PRIMARY is not set
# CONFIG_FRAMEBUFFER_CONSOLE_ROTATION is not set
# CONFIG_FONTS is not set
CONFIG_FONT_8x8=y
CONFIG_FONT_8x16=y
CONFIG_LOGO=y
CONFIG_LOGO_LINUX_MONO=y
CONFIG_LOGO_LINUX_VGA16=y
CONFIG_LOGO_LINUX_CLUT224=y
CONFIG_SOUND=y
CONFIG_SOUND_OSS_CORE=y
# CONFIG_SOUND_OSS_CORE_PRECLAIM is not set
CONFIG_SND=y
CONFIG_SND_TIMER=y
CONFIG_SND_PCM=y
CONFIG_SND_SEQUENCER=m
CONFIG_SND_SEQ_DUMMY=m
CONFIG_SND_OSSEMUL=y
CONFIG_SND_MIXER_OSS=m
CONFIG_SND_PCM_OSS=m
CONFIG_SND_PCM_OSS_PLUGINS=y
CONFIG_SND_SEQUENCER_OSS=y
CONFIG_SND_HRTIMER=m
CONFIG_SND_SEQ_HRTIMER_DEFAULT=y
# CONFIG_SND_DYNAMIC_MINORS is not set
# CONFIG_SND_SUPPORT_OLD_API is not set
# CONFIG_SND_VERBOSE_PROCFS is not set
# CONFIG_SND_VERBOSE_PRINTK is not set
# CONFIG_SND_DEBUG is not set
CONFIG_SND_VMASTER=y
CONFIG_SND_DMA_SGBUF=y
# CONFIG_SND_RAWMIDI_SEQ is not set
# CONFIG_SND_OPL3_LIB_SEQ is not set
# CONFIG_SND_OPL4_LIB_SEQ is not set
# CONFIG_SND_SBAWE_SEQ is not set
# CONFIG_SND_EMU10K1_SEQ is not set
CONFIG_SND_AC97_CODEC=m
# CONFIG_SND_DRIVERS is not set
CONFIG_SND_PCI=y
# CONFIG_SND_AD1889 is not set
# CONFIG_SND_ALS300 is not set
# CONFIG_SND_ALS4000 is not set
# CONFIG_SND_ALI5451 is not set
CONFIG_SND_ATIIXP=m
# CONFIG_SND_ATIIXP_MODEM is not set
# CONFIG_SND_AU8810 is not set
# CONFIG_SND_AU8820 is not set
# CONFIG_SND_AU8830 is not set
# CONFIG_SND_AW2 is not set
# CONFIG_SND_AZT3328 is not set
# CONFIG_SND_BT87X is not set
# CONFIG_SND_CA0106 is not set
# CONFIG_SND_CMIPCI is not set
# CONFIG_SND_OXYGEN is not set
# CONFIG_SND_CS4281 is not set
# CONFIG_SND_CS46XX is not set
# CONFIG_SND_CS5530 is not set
# CONFIG_SND_CS5535AUDIO is not set
# CONFIG_SND_CTXFI is not set
# CONFIG_SND_DARLA20 is not set
# CONFIG_SND_GINA20 is not set
# CONFIG_SND_LAYLA20 is not set
# CONFIG_SND_DARLA24 is not set
# CONFIG_SND_GINA24 is not set
# CONFIG_SND_LAYLA24 is not set
# CONFIG_SND_MONA is not set
# CONFIG_SND_MIA is not set
# CONFIG_SND_ECHO3G is not set
# CONFIG_SND_INDIGO is not set
# CONFIG_SND_INDIGOIO is not set
# CONFIG_SND_INDIGODJ is not set
# CONFIG_SND_INDIGOIOX is not set
# CONFIG_SND_INDIGODJX is not set
# CONFIG_SND_EMU10K1 is not set
# CONFIG_SND_EMU10K1X is not set
# CONFIG_SND_ENS1370 is not set
# CONFIG_SND_ENS1371 is not set
# CONFIG_SND_ES1938 is not set
# CONFIG_SND_ES1968 is not set
# CONFIG_SND_FM801 is not set
CONFIG_SND_HDA_INTEL=y
# CONFIG_SND_HDA_HWDEP is not set
# CONFIG_SND_HDA_INPUT_BEEP is not set
# CONFIG_SND_HDA_INPUT_JACK is not set
# CONFIG_SND_HDA_PATCH_LOADER is not set
CONFIG_SND_HDA_CODEC_REALTEK=y
CONFIG_SND_HDA_CODEC_ANALOG=y
CONFIG_SND_HDA_CODEC_SIGMATEL=y
CONFIG_SND_HDA_CODEC_VIA=y
CONFIG_SND_HDA_CODEC_ATIHDMI=y
CONFIG_SND_HDA_CODEC_NVHDMI=y
CONFIG_SND_HDA_CODEC_INTELHDMI=y
CONFIG_SND_HDA_ELD=y
CONFIG_SND_HDA_CODEC_CIRRUS=y
CONFIG_SND_HDA_CODEC_CONEXANT=y
CONFIG_SND_HDA_CODEC_CA0110=y
CONFIG_SND_HDA_CODEC_CMEDIA=y
CONFIG_SND_HDA_CODEC_SI3054=y
CONFIG_SND_HDA_GENERIC=y
CONFIG_SND_HDA_POWER_SAVE=y
CONFIG_SND_HDA_POWER_SAVE_DEFAULT=0
# CONFIG_SND_HDSP is not set
# CONFIG_SND_HDSPM is not set
# CONFIG_SND_HIFIER is not set
# CONFIG_SND_ICE1712 is not set
# CONFIG_SND_ICE1724 is not set
CONFIG_SND_INTEL8X0=m
CONFIG_SND_INTEL8X0M=m
# CONFIG_SND_KORG1212 is not set
# CONFIG_SND_LX6464ES is not set
# CONFIG_SND_MAESTRO3 is not set
# CONFIG_SND_MIXART is not set
# CONFIG_SND_NM256 is not set
# CONFIG_SND_PCXHR is not set
# CONFIG_SND_RIPTIDE is not set
# CONFIG_SND_RME32 is not set
# CONFIG_SND_RME96 is not set
# CONFIG_SND_RME9652 is not set
# CONFIG_SND_SONICVIBES is not set
# CONFIG_SND_TRIDENT is not set
# CONFIG_SND_VIA82XX is not set
# CONFIG_SND_VIA82XX_MODEM is not set
# CONFIG_SND_VIRTUOSO is not set
# CONFIG_SND_VX222 is not set
# CONFIG_SND_YMFPCI is not set
# CONFIG_SND_USB is not set
# CONFIG_SND_SOC is not set
# CONFIG_SOUND_PRIME is not set
CONFIG_AC97_BUS=m
CONFIG_HID_SUPPORT=y
CONFIG_HID=y
# CONFIG_HIDRAW is not set

#
# USB Input Devices
#
CONFIG_USB_HID=y
# CONFIG_HID_PID is not set
# CONFIG_USB_HIDDEV is not set

#
# Special HID drivers
#
# CONFIG_HID_3M_PCT is not set
CONFIG_HID_A4TECH=y
CONFIG_HID_APPLE=y
CONFIG_HID_BELKIN=y
CONFIG_HID_CHERRY=y
CONFIG_HID_CHICONY=y
CONFIG_HID_CYPRESS=y
CONFIG_HID_DRAGONRISE=y
# CONFIG_DRAGONRISE_FF is not set
CONFIG_HID_EZKEY=y
CONFIG_HID_KYE=y
CONFIG_HID_GYRATION=y
CONFIG_HID_TWINHAN=y
CONFIG_HID_KENSINGTON=y
CONFIG_HID_LOGITECH=y
# CONFIG_LOGITECH_FF is not set
# CONFIG_LOGIRUMBLEPAD2_FF is not set
# CONFIG_LOGIG940_FF is not set
CONFIG_HID_MICROSOFT=y
# CONFIG_HID_MOSART is not set
CONFIG_HID_MONTEREY=y
CONFIG_HID_NTRIG=y
CONFIG_HID_ORTEK=y
CONFIG_HID_PANTHERLORD=y
# CONFIG_PANTHERLORD_FF is not set
CONFIG_HID_PETALYNX=y
# CONFIG_HID_QUANTA is not set
CONFIG_HID_SAMSUNG=y
CONFIG_HID_SONY=y
# CONFIG_HID_STANTUM is not set
CONFIG_HID_SUNPLUS=y
CONFIG_HID_GREENASIA=y
# CONFIG_GREENASIA_FF is not set
CONFIG_HID_SMARTJOYPLUS=y
# CONFIG_SMARTJOYPLUS_FF is not set
CONFIG_HID_TOPSEED=y
CONFIG_HID_THRUSTMASTER=y
# CONFIG_THRUSTMASTER_FF is not set
CONFIG_HID_ZEROPLUS=y
# CONFIG_ZEROPLUS_FF is not set
CONFIG_USB_SUPPORT=y
CONFIG_USB_ARCH_HAS_HCD=y
CONFIG_USB_ARCH_HAS_OHCI=y
CONFIG_USB_ARCH_HAS_EHCI=y
CONFIG_USB=y
# CONFIG_USB_DEBUG is not set
CONFIG_USB_ANNOUNCE_NEW_DEVICES=y

#
# Miscellaneous USB options
#
CONFIG_USB_DEVICEFS=y
# CONFIG_USB_DEVICE_CLASS is not set
# CONFIG_USB_DYNAMIC_MINORS is not set
# CONFIG_USB_OTG is not set
# CONFIG_USB_MON is not set
# CONFIG_USB_WUSB is not set
# CONFIG_USB_WUSB_CBAF is not set

#
# USB Host Controller Drivers
#
# CONFIG_USB_C67X00_HCD is not set
# CONFIG_USB_XHCI_HCD is not set
CONFIG_USB_EHCI_HCD=y
CONFIG_USB_EHCI_ROOT_HUB_TT=y
# CONFIG_USB_EHCI_TT_NEWSCHED is not set
# CONFIG_USB_OXU210HP_HCD is not set
# CONFIG_USB_ISP116X_HCD is not set
# CONFIG_USB_ISP1760_HCD is not set
# CONFIG_USB_ISP1362_HCD is not set
CONFIG_USB_OHCI_HCD=m
# CONFIG_USB_OHCI_BIG_ENDIAN_DESC is not set
# CONFIG_USB_OHCI_BIG_ENDIAN_MMIO is not set
CONFIG_USB_OHCI_LITTLE_ENDIAN=y
# CONFIG_USB_UHCI_HCD is not set
# CONFIG_USB_SL811_HCD is not set
# CONFIG_USB_R8A66597_HCD is not set
# CONFIG_USB_WHCI_HCD is not set
# CONFIG_USB_HWA_HCD is not set

#
# USB Device Class drivers
#
# CONFIG_USB_ACM is not set
CONFIG_USB_PRINTER=m
# CONFIG_USB_WDM is not set
# CONFIG_USB_TMC is not set

#
# NOTE: USB_STORAGE depends on SCSI but BLK_DEV_SD may
#

#
# also be needed; see USB_STORAGE Help for more info
#
CONFIG_USB_STORAGE=y
# CONFIG_USB_STORAGE_DEBUG is not set
# CONFIG_USB_STORAGE_DATAFAB is not set
# CONFIG_USB_STORAGE_FREECOM is not set
# CONFIG_USB_STORAGE_ISD200 is not set
# CONFIG_USB_STORAGE_USBAT is not set
# CONFIG_USB_STORAGE_SDDR09 is not set
# CONFIG_USB_STORAGE_SDDR55 is not set
# CONFIG_USB_STORAGE_JUMPSHOT is not set
# CONFIG_USB_STORAGE_ALAUDA is not set
# CONFIG_USB_STORAGE_ONETOUCH is not set
# CONFIG_USB_STORAGE_KARMA is not set
# CONFIG_USB_STORAGE_CYPRESS_ATACB is not set
# CONFIG_USB_LIBUSUAL is not set

#
# USB Imaging devices
#
# CONFIG_USB_MDC800 is not set
# CONFIG_USB_MICROTEK is not set

#
# USB port drivers
#
# CONFIG_USB_SERIAL is not set

#
# USB Miscellaneous drivers
#
# CONFIG_USB_EMI62 is not set
# CONFIG_USB_EMI26 is not set
# CONFIG_USB_ADUTUX is not set
# CONFIG_USB_SEVSEG is not set
# CONFIG_USB_RIO500 is not set
# CONFIG_USB_LEGOTOWER is not set
# CONFIG_USB_LCD is not set
# CONFIG_USB_LED is not set
# CONFIG_USB_CYPRESS_CY7C63 is not set
# CONFIG_USB_CYTHERM is not set
# CONFIG_USB_IDMOUSE is not set
# CONFIG_USB_FTDI_ELAN is not set
# CONFIG_USB_APPLEDISPLAY is not set
# CONFIG_USB_SISUSBVGA is not set
# CONFIG_USB_LD is not set
# CONFIG_USB_TRANCEVIBRATOR is not set
# CONFIG_USB_IOWARRIOR is not set
# CONFIG_USB_TEST is not set
# CONFIG_USB_ISIGHTFW is not set
# CONFIG_USB_GADGET is not set

#
# OTG and related infrastructure
#
# CONFIG_NOP_USB_XCEIV is not set
# CONFIG_UWB is not set
# CONFIG_MMC is not set
# CONFIG_MEMSTICK is not set
# CONFIG_NEW_LEDS is not set
# CONFIG_ACCESSIBILITY is not set
# CONFIG_INFINIBAND is not set
CONFIG_EDAC=y

#
# Reporting subsystems
#
# CONFIG_EDAC_DEBUG is not set
CONFIG_EDAC_DECODE_MCE=y
CONFIG_EDAC_MM_EDAC=m
CONFIG_EDAC_AMD64=m
CONFIG_EDAC_AMD64_ERROR_INJECTION=y
# CONFIG_EDAC_E752X is not set
# CONFIG_EDAC_I82975X is not set
# CONFIG_EDAC_I3000 is not set
# CONFIG_EDAC_I3200 is not set
# CONFIG_EDAC_X38 is not set
# CONFIG_EDAC_I5400 is not set
# CONFIG_EDAC_I5000 is not set
# CONFIG_EDAC_I5100 is not set
CONFIG_RTC_LIB=y
CONFIG_RTC_CLASS=y
CONFIG_RTC_HCTOSYS=y
CONFIG_RTC_HCTOSYS_DEVICE="rtc0"
# CONFIG_RTC_DEBUG is not set

#
# RTC interfaces
#
CONFIG_RTC_INTF_SYSFS=y
CONFIG_RTC_INTF_PROC=y
CONFIG_RTC_INTF_DEV=y
# CONFIG_RTC_INTF_DEV_UIE_EMUL is not set
# CONFIG_RTC_DRV_TEST is not set

#
# I2C RTC drivers
#
# CONFIG_RTC_DRV_DS1307 is not set
# CONFIG_RTC_DRV_DS1374 is not set
# CONFIG_RTC_DRV_DS1672 is not set
# CONFIG_RTC_DRV_MAX6900 is not set
# CONFIG_RTC_DRV_RS5C372 is not set
# CONFIG_RTC_DRV_ISL1208 is not set
# CONFIG_RTC_DRV_X1205 is not set
# CONFIG_RTC_DRV_PCF8563 is not set
# CONFIG_RTC_DRV_PCF8583 is not set
# CONFIG_RTC_DRV_M41T80 is not set
# CONFIG_RTC_DRV_BQ32K is not set
# CONFIG_RTC_DRV_S35390A is not set
# CONFIG_RTC_DRV_FM3130 is not set
# CONFIG_RTC_DRV_RX8581 is not set
# CONFIG_RTC_DRV_RX8025 is not set

#
# SPI RTC drivers
#

#
# Platform RTC drivers
#
CONFIG_RTC_DRV_CMOS=y
# CONFIG_RTC_DRV_DS1286 is not set
# CONFIG_RTC_DRV_DS1511 is not set
# CONFIG_RTC_DRV_DS1553 is not set
# CONFIG_RTC_DRV_DS1742 is not set
# CONFIG_RTC_DRV_STK17TA8 is not set
# CONFIG_RTC_DRV_M48T86 is not set
# CONFIG_RTC_DRV_M48T35 is not set
# CONFIG_RTC_DRV_M48T59 is not set
# CONFIG_RTC_DRV_MSM6242 is not set
# CONFIG_RTC_DRV_BQ4802 is not set
# CONFIG_RTC_DRV_RP5C01 is not set
# CONFIG_RTC_DRV_V3020 is not set

#
# on-CPU RTC drivers
#
# CONFIG_DMADEVICES is not set
# CONFIG_AUXDISPLAY is not set
# CONFIG_UIO is not set

#
# TI VLYNQ
#
# CONFIG_STAGING is not set
# CONFIG_X86_PLATFORM_DEVICES is not set

#
# Firmware Drivers
#
# CONFIG_EDD is not set
CONFIG_FIRMWARE_MEMMAP=y
# CONFIG_DELL_RBU is not set
# CONFIG_DCDBAS is not set
# CONFIG_DMIID is not set
# CONFIG_ISCSI_IBFT_FIND is not set

#
# File systems
#
CONFIG_EXT2_FS=y
CONFIG_EXT2_FS_XATTR=y
CONFIG_EXT2_FS_POSIX_ACL=y
CONFIG_EXT2_FS_SECURITY=y
# CONFIG_EXT2_FS_XIP is not set
CONFIG_EXT3_FS=y
# CONFIG_EXT3_DEFAULTS_TO_ORDERED is not set
CONFIG_EXT3_FS_XATTR=y
CONFIG_EXT3_FS_POSIX_ACL=y
CONFIG_EXT3_FS_SECURITY=y
# CONFIG_EXT4_FS is not set
CONFIG_JBD=y
# CONFIG_JBD_DEBUG is not set
CONFIG_FS_MBCACHE=y
# CONFIG_REISERFS_FS is not set
# CONFIG_JFS_FS is not set
CONFIG_FS_POSIX_ACL=y
# CONFIG_XFS_FS is not set
# CONFIG_GFS2_FS is not set
# CONFIG_OCFS2_FS is not set
# CONFIG_BTRFS_FS is not set
# CONFIG_NILFS2_FS is not set
CONFIG_FILE_LOCKING=y
CONFIG_FSNOTIFY=y
# CONFIG_DNOTIFY is not set
CONFIG_INOTIFY=y
CONFIG_INOTIFY_USER=y
# CONFIG_QUOTA is not set
# CONFIG_AUTOFS_FS is not set
# CONFIG_AUTOFS4_FS is not set
CONFIG_FUSE_FS=m
# CONFIG_CUSE is not set

#
# Caches
#
# CONFIG_FSCACHE is not set

#
# CD-ROM/DVD Filesystems
#
CONFIG_ISO9660_FS=m
CONFIG_JOLIET=y
CONFIG_ZISOFS=y
CONFIG_UDF_FS=m
CONFIG_UDF_NLS=y

#
# DOS/FAT/NT Filesystems
#
CONFIG_FAT_FS=m
CONFIG_MSDOS_FS=m
CONFIG_VFAT_FS=m
CONFIG_FAT_DEFAULT_CODEPAGE=437
CONFIG_FAT_DEFAULT_IOCHARSET="iso8859-15"
CONFIG_NTFS_FS=m
# CONFIG_NTFS_DEBUG is not set
CONFIG_NTFS_RW=y

#
# Pseudo filesystems
#
CONFIG_PROC_FS=y
CONFIG_PROC_KCORE=y
CONFIG_PROC_SYSCTL=y
CONFIG_PROC_PAGE_MONITOR=y
CONFIG_SYSFS=y
CONFIG_TMPFS=y
# CONFIG_TMPFS_POSIX_ACL is not set
CONFIG_HUGETLBFS=y
CONFIG_HUGETLB_PAGE=y
CONFIG_CONFIGFS_FS=y
# CONFIG_MISC_FILESYSTEMS is not set
CONFIG_NETWORK_FILESYSTEMS=y
CONFIG_NFS_FS=m
CONFIG_NFS_V3=y
CONFIG_NFS_V3_ACL=y
CONFIG_NFS_V4=y
# CONFIG_NFS_V4_1 is not set
CONFIG_NFSD=m
CONFIG_NFSD_V2_ACL=y
CONFIG_NFSD_V3=y
CONFIG_NFSD_V3_ACL=y
CONFIG_NFSD_V4=y
CONFIG_LOCKD=m
CONFIG_LOCKD_V4=y
CONFIG_EXPORTFS=m
CONFIG_NFS_ACL_SUPPORT=m
CONFIG_NFS_COMMON=y
CONFIG_SUNRPC=m
CONFIG_SUNRPC_GSS=m
CONFIG_RPCSEC_GSS_KRB5=m
# CONFIG_RPCSEC_GSS_SPKM3 is not set
# CONFIG_SMB_FS is not set
# CONFIG_CEPH_FS is not set
CONFIG_CIFS=m
# CONFIG_CIFS_STATS is not set
# CONFIG_CIFS_WEAK_PW_HASH is not set
# CONFIG_CIFS_UPCALL is not set
CONFIG_CIFS_XATTR=y
CONFIG_CIFS_POSIX=y
# CONFIG_CIFS_DEBUG2 is not set
# CONFIG_CIFS_DFS_UPCALL is not set
# CONFIG_CIFS_EXPERIMENTAL is not set
# CONFIG_NCP_FS is not set
# CONFIG_CODA_FS is not set
# CONFIG_AFS_FS is not set

#
# Partition Types
#
CONFIG_PARTITION_ADVANCED=y
# CONFIG_ACORN_PARTITION is not set
# CONFIG_OSF_PARTITION is not set
# CONFIG_AMIGA_PARTITION is not set
# CONFIG_ATARI_PARTITION is not set
CONFIG_MAC_PARTITION=y
CONFIG_MSDOS_PARTITION=y
CONFIG_BSD_DISKLABEL=y
# CONFIG_MINIX_SUBPARTITION is not set
CONFIG_SOLARIS_X86_PARTITION=y
# CONFIG_UNIXWARE_DISKLABEL is not set
CONFIG_LDM_PARTITION=y
# CONFIG_LDM_DEBUG is not set
# CONFIG_SGI_PARTITION is not set
# CONFIG_ULTRIX_PARTITION is not set
CONFIG_SUN_PARTITION=y
# CONFIG_KARMA_PARTITION is not set
# CONFIG_EFI_PARTITION is not set
# CONFIG_SYSV68_PARTITION is not set
CONFIG_NLS=y
CONFIG_NLS_DEFAULT="iso8859-15"
CONFIG_NLS_CODEPAGE_437=m
CONFIG_NLS_CODEPAGE_737=m
CONFIG_NLS_CODEPAGE_775=m
CONFIG_NLS_CODEPAGE_850=m
CONFIG_NLS_CODEPAGE_852=m
CONFIG_NLS_CODEPAGE_855=m
CONFIG_NLS_CODEPAGE_857=m
CONFIG_NLS_CODEPAGE_860=m
CONFIG_NLS_CODEPAGE_861=m
CONFIG_NLS_CODEPAGE_862=m
CONFIG_NLS_CODEPAGE_863=m
CONFIG_NLS_CODEPAGE_864=m
CONFIG_NLS_CODEPAGE_865=m
CONFIG_NLS_CODEPAGE_866=m
CONFIG_NLS_CODEPAGE_869=m
CONFIG_NLS_CODEPAGE_936=m
CONFIG_NLS_CODEPAGE_950=m
CONFIG_NLS_CODEPAGE_932=m
CONFIG_NLS_CODEPAGE_949=m
CONFIG_NLS_CODEPAGE_874=m
CONFIG_NLS_ISO8859_8=m
CONFIG_NLS_CODEPAGE_1250=m
CONFIG_NLS_CODEPAGE_1251=m
CONFIG_NLS_ASCII=m
CONFIG_NLS_ISO8859_1=m
CONFIG_NLS_ISO8859_2=m
CONFIG_NLS_ISO8859_3=m
CONFIG_NLS_ISO8859_4=m
CONFIG_NLS_ISO8859_5=m
CONFIG_NLS_ISO8859_6=m
CONFIG_NLS_ISO8859_7=m
CONFIG_NLS_ISO8859_9=m
CONFIG_NLS_ISO8859_13=m
CONFIG_NLS_ISO8859_14=m
CONFIG_NLS_ISO8859_15=m
CONFIG_NLS_KOI8_R=m
CONFIG_NLS_KOI8_U=m
CONFIG_NLS_UTF8=m
# CONFIG_DLM is not set

#
# Kernel hacking
#
CONFIG_TRACE_IRQFLAGS_SUPPORT=y
CONFIG_PRINTK_TIME=y
CONFIG_ENABLE_WARN_DEPRECATED=y
CONFIG_ENABLE_MUST_CHECK=y
CONFIG_FRAME_WARN=2048
CONFIG_MAGIC_SYSRQ=y
# CONFIG_STRIP_ASM_SYMS is not set
CONFIG_UNUSED_SYMBOLS=y
CONFIG_DEBUG_FS=y
# CONFIG_HEADERS_CHECK is not set
CONFIG_DEBUG_KERNEL=y
# CONFIG_DEBUG_SHIRQ is not set
CONFIG_DETECT_SOFTLOCKUP=y
# CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC is not set
CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC_VALUE=0
CONFIG_DETECT_HUNG_TASK=y
# CONFIG_BOOTPARAM_HUNG_TASK_PANIC is not set
CONFIG_BOOTPARAM_HUNG_TASK_PANIC_VALUE=0
CONFIG_SCHED_DEBUG=y
# CONFIG_SCHEDSTATS is not set
CONFIG_TIMER_STATS=y
# CONFIG_DEBUG_OBJECTS is not set
# CONFIG_SLUB_DEBUG_ON is not set
# CONFIG_SLUB_STATS is not set
# CONFIG_DEBUG_KMEMLEAK is not set
CONFIG_DEBUG_PREEMPT=y
# CONFIG_DEBUG_RT_MUTEXES is not set
# CONFIG_RT_MUTEX_TESTER is not set
CONFIG_DEBUG_SPINLOCK=y
CONFIG_DEBUG_MUTEXES=y
CONFIG_DEBUG_LOCK_ALLOC=y
CONFIG_PROVE_LOCKING=y
# CONFIG_PROVE_RCU is not set
CONFIG_LOCKDEP=y
CONFIG_LOCK_STAT=y
CONFIG_DEBUG_LOCKDEP=y
CONFIG_TRACE_IRQFLAGS=y
CONFIG_DEBUG_SPINLOCK_SLEEP=y
CONFIG_DEBUG_LOCKING_API_SELFTESTS=y
CONFIG_STACKTRACE=y
# CONFIG_DEBUG_KOBJECT is not set
CONFIG_DEBUG_BUGVERBOSE=y
CONFIG_DEBUG_INFO=y
# CONFIG_DEBUG_VM is not set
# CONFIG_DEBUG_VIRTUAL is not set
# CONFIG_DEBUG_WRITECOUNT is not set
CONFIG_DEBUG_MEMORY_INIT=y
# CONFIG_DEBUG_LIST is not set
# CONFIG_DEBUG_SG is not set
# CONFIG_DEBUG_NOTIFIERS is not set
# CONFIG_DEBUG_CREDENTIALS is not set
CONFIG_ARCH_WANT_FRAME_POINTERS=y
CONFIG_FRAME_POINTER=y
# CONFIG_BOOT_PRINTK_DELAY is not set
# CONFIG_RCU_TORTURE_TEST is not set
# CONFIG_RCU_CPU_STALL_DETECTOR is not set
# CONFIG_BACKTRACE_SELF_TEST is not set
# CONFIG_DEBUG_BLOCK_EXT_DEVT is not set
# CONFIG_DEBUG_FORCE_WEAK_PER_CPU is not set
# CONFIG_LKDTM is not set
# CONFIG_FAULT_INJECTION is not set
# CONFIG_LATENCYTOP is not set
# CONFIG_SYSCTL_SYSCALL_CHECK is not set
# CONFIG_DEBUG_PAGEALLOC is not set
CONFIG_USER_STACKTRACE_SUPPORT=y
CONFIG_NOP_TRACER=y
CONFIG_HAVE_FTRACE_NMI_ENTER=y
CONFIG_HAVE_FUNCTION_TRACER=y
CONFIG_HAVE_FUNCTION_GRAPH_TRACER=y
CONFIG_HAVE_FUNCTION_GRAPH_FP_TEST=y
CONFIG_HAVE_FUNCTION_TRACE_MCOUNT_TEST=y
CONFIG_HAVE_DYNAMIC_FTRACE=y
CONFIG_HAVE_FTRACE_MCOUNT_RECORD=y
CONFIG_HAVE_SYSCALL_TRACEPOINTS=y
CONFIG_TRACER_MAX_TRACE=y
CONFIG_RING_BUFFER=y
CONFIG_FTRACE_NMI_ENTER=y
CONFIG_EVENT_TRACING=y
CONFIG_CONTEXT_SWITCH_TRACER=y
CONFIG_RING_BUFFER_ALLOW_SWAP=y
CONFIG_TRACING=y
CONFIG_GENERIC_TRACER=y
CONFIG_TRACING_SUPPORT=y
CONFIG_FTRACE=y
CONFIG_FUNCTION_TRACER=y
CONFIG_FUNCTION_GRAPH_TRACER=y
CONFIG_IRQSOFF_TRACER=y
CONFIG_PREEMPT_TRACER=y
CONFIG_SYSPROF_TRACER=y
CONFIG_SCHED_TRACER=y
# CONFIG_FTRACE_SYSCALLS is not set
# CONFIG_BOOT_TRACER is not set
CONFIG_BRANCH_PROFILE_NONE=y
# CONFIG_PROFILE_ANNOTATED_BRANCHES is not set
# CONFIG_PROFILE_ALL_BRANCHES is not set
# CONFIG_KSYM_TRACER is not set
# CONFIG_STACK_TRACER is not set
# CONFIG_KMEMTRACE is not set
# CONFIG_WORKQUEUE_TRACER is not set
# CONFIG_BLK_DEV_IO_TRACE is not set
CONFIG_DYNAMIC_FTRACE=y
CONFIG_FUNCTION_PROFILER=y
CONFIG_FTRACE_MCOUNT_RECORD=y
# CONFIG_FTRACE_STARTUP_TEST is not set
# CONFIG_MMIOTRACE is not set
# CONFIG_RING_BUFFER_BENCHMARK is not set
# CONFIG_PROVIDE_OHCI1394_DMA_INIT is not set
# CONFIG_DYNAMIC_DEBUG is not set
# CONFIG_DMA_API_DEBUG is not set
# CONFIG_SAMPLES is not set
CONFIG_HAVE_ARCH_KGDB=y
# CONFIG_KGDB is not set
CONFIG_HAVE_ARCH_KMEMCHECK=y
# CONFIG_STRICT_DEVMEM is not set
CONFIG_X86_VERBOSE_BOOTUP=y
CONFIG_EARLY_PRINTK=y
# CONFIG_EARLY_PRINTK_DBGP is not set
# CONFIG_DEBUG_STACKOVERFLOW is not set
# CONFIG_DEBUG_STACK_USAGE is not set
# CONFIG_DEBUG_PER_CPU_MAPS is not set
# CONFIG_X86_PTDUMP is not set
# CONFIG_DEBUG_RODATA is not set
# CONFIG_DEBUG_NX_TEST is not set
# CONFIG_IOMMU_DEBUG is not set
# CONFIG_IOMMU_STRESS is not set
CONFIG_HAVE_MMIOTRACE_SUPPORT=y
CONFIG_IO_DELAY_TYPE_0X80=0
CONFIG_IO_DELAY_TYPE_0XED=1
CONFIG_IO_DELAY_TYPE_UDELAY=2
CONFIG_IO_DELAY_TYPE_NONE=3
CONFIG_IO_DELAY_0X80=y
# CONFIG_IO_DELAY_0XED is not set
# CONFIG_IO_DELAY_UDELAY is not set
# CONFIG_IO_DELAY_NONE is not set
CONFIG_DEFAULT_IO_DELAY_TYPE=0
# CONFIG_DEBUG_BOOT_PARAMS is not set
# CONFIG_CPA_DEBUG is not set
# CONFIG_OPTIMIZE_INLINING is not set
# CONFIG_DEBUG_STRICT_USER_COPY_CHECKS is not set

#
# Security options
#
CONFIG_KEYS=y
# CONFIG_KEYS_DEBUG_PROC_KEYS is not set
# CONFIG_SECURITY is not set
# CONFIG_SECURITYFS is not set
# CONFIG_DEFAULT_SECURITY_SELINUX is not set
# CONFIG_DEFAULT_SECURITY_SMACK is not set
# CONFIG_DEFAULT_SECURITY_TOMOYO is not set
CONFIG_DEFAULT_SECURITY_DAC=y
CONFIG_DEFAULT_SECURITY=""
CONFIG_CRYPTO=y

#
# Crypto core or helper
#
CONFIG_CRYPTO_ALGAPI=y
CONFIG_CRYPTO_ALGAPI2=y
CONFIG_CRYPTO_AEAD=m
CONFIG_CRYPTO_AEAD2=y
CONFIG_CRYPTO_BLKCIPHER=y
CONFIG_CRYPTO_BLKCIPHER2=y
CONFIG_CRYPTO_HASH=y
CONFIG_CRYPTO_HASH2=y
CONFIG_CRYPTO_RNG2=y
CONFIG_CRYPTO_PCOMP=y
CONFIG_CRYPTO_MANAGER=y
CONFIG_CRYPTO_MANAGER2=y
# CONFIG_CRYPTO_GF128MUL is not set
CONFIG_CRYPTO_NULL=m
# CONFIG_CRYPTO_PCRYPT is not set
CONFIG_CRYPTO_WORKQUEUE=y
CONFIG_CRYPTO_CRYPTD=m
CONFIG_CRYPTO_AUTHENC=m
CONFIG_CRYPTO_TEST=m

#
# Authenticated Encryption with Associated Data
#
# CONFIG_CRYPTO_CCM is not set
# CONFIG_CRYPTO_GCM is not set
# CONFIG_CRYPTO_SEQIV is not set

#
# Block modes
#
CONFIG_CRYPTO_CBC=y
# CONFIG_CRYPTO_CTR is not set
# CONFIG_CRYPTO_CTS is not set
CONFIG_CRYPTO_ECB=m
# CONFIG_CRYPTO_LRW is not set
# CONFIG_CRYPTO_PCBC is not set
# CONFIG_CRYPTO_XTS is not set
CONFIG_CRYPTO_FPU=m

#
# Hash modes
#
CONFIG_CRYPTO_HMAC=y
# CONFIG_CRYPTO_XCBC is not set
# CONFIG_CRYPTO_VMAC is not set

#
# Digest
#
CONFIG_CRYPTO_CRC32C=m
# CONFIG_CRYPTO_CRC32C_INTEL is not set
# CONFIG_CRYPTO_GHASH is not set
CONFIG_CRYPTO_MD4=m
CONFIG_CRYPTO_MD5=y
CONFIG_CRYPTO_MICHAEL_MIC=m
CONFIG_CRYPTO_RMD128=m
CONFIG_CRYPTO_RMD160=m
CONFIG_CRYPTO_RMD256=m
CONFIG_CRYPTO_RMD320=m
CONFIG_CRYPTO_SHA1=m
CONFIG_CRYPTO_SHA256=m
CONFIG_CRYPTO_SHA512=m
CONFIG_CRYPTO_TGR192=m
CONFIG_CRYPTO_WP512=m
# CONFIG_CRYPTO_GHASH_CLMUL_NI_INTEL is not set

#
# Ciphers
#
CONFIG_CRYPTO_AES=m
CONFIG_CRYPTO_AES_X86_64=m
CONFIG_CRYPTO_AES_NI_INTEL=m
# CONFIG_CRYPTO_ANUBIS is not set
CONFIG_CRYPTO_ARC4=m
CONFIG_CRYPTO_BLOWFISH=m
# CONFIG_CRYPTO_CAMELLIA is not set
CONFIG_CRYPTO_CAST5=m
CONFIG_CRYPTO_CAST6=m
CONFIG_CRYPTO_DES=m
# CONFIG_CRYPTO_FCRYPT is not set
CONFIG_CRYPTO_KHAZAD=m
# CONFIG_CRYPTO_SALSA20 is not set
# CONFIG_CRYPTO_SALSA20_X86_64 is not set
# CONFIG_CRYPTO_SEED is not set
CONFIG_CRYPTO_SERPENT=m
CONFIG_CRYPTO_TEA=m
CONFIG_CRYPTO_TWOFISH=m
CONFIG_CRYPTO_TWOFISH_COMMON=m
CONFIG_CRYPTO_TWOFISH_X86_64=m

#
# Compression
#
CONFIG_CRYPTO_DEFLATE=m
CONFIG_CRYPTO_ZLIB=m
# CONFIG_CRYPTO_LZO is not set

#
# Random Number Generation
#
# CONFIG_CRYPTO_ANSI_CPRNG is not set
# CONFIG_CRYPTO_HW is not set
CONFIG_HAVE_KVM=y
CONFIG_HAVE_KVM_IRQCHIP=y
CONFIG_HAVE_KVM_EVENTFD=y
CONFIG_KVM_APIC_ARCHITECTURE=y
CONFIG_KVM_MMIO=y
CONFIG_VIRTUALIZATION=y
CONFIG_KVM=m
# CONFIG_KVM_INTEL is not set
CONFIG_KVM_AMD=m
# CONFIG_VHOST_NET is not set
# CONFIG_VIRTIO_PCI is not set
# CONFIG_VIRTIO_BALLOON is not set
CONFIG_BINARY_PRINTF=y

#
# Library routines
#
CONFIG_BITREVERSE=y
CONFIG_GENERIC_FIND_FIRST_BIT=y
CONFIG_GENERIC_FIND_NEXT_BIT=y
CONFIG_GENERIC_FIND_LAST_BIT=y
CONFIG_CRC_CCITT=m
CONFIG_CRC16=m
CONFIG_CRC_T10DIF=m
CONFIG_CRC_ITU_T=m
CONFIG_CRC32=y
CONFIG_CRC7=m
CONFIG_LIBCRC32C=m
CONFIG_ZLIB_INFLATE=m
CONFIG_ZLIB_DEFLATE=m
CONFIG_TEXTSEARCH=y
CONFIG_TEXTSEARCH_KMP=m
CONFIG_TEXTSEARCH_BM=m
CONFIG_TEXTSEARCH_FSM=m
CONFIG_HAS_IOMEM=y
CONFIG_HAS_IOPORT=y
CONFIG_HAS_DMA=y
CONFIG_NLATTR=y

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3)
  2010-04-06 17:04                     ` Rik van Riel
@ 2010-04-06 18:28                       ` Linus Torvalds
  2010-04-06 19:03                         ` Andrew Morton
  2010-04-07  8:36                         ` Peter Zijlstra
  0 siblings, 2 replies; 242+ messages in thread
From: Linus Torvalds @ 2010-04-06 18:28 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Minchan Kim, KOSAKI Motohiro, Borislav Petkov, Andrew Morton,
	Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin,
	Andrea Arcangeli, Hugh Dickins



On Tue, 6 Apr 2010, Rik van Riel wrote:
> 
> Which other cases? When do we ever walk the "same_vma" list
> not from the context of the process owning the vma?

That's the point. What does 'owning the vma' mean? That's exactly what I'm 
asking to be documented.

Quite frankly, the thing is a mess. There is _no_ comment on why it's ok 
to modify the list or walk the list, except for the one totally misleading 
one, since the page_table_lock has at most a _secondary_ meaning in the 
whole ownership (ie it is used only when we do _not_ own the vma chain 
exclusively).

So your very comment shows the whole confusion. No, we do not "own the 
vma" in all cases. Sometimes we just have a read-lock on it.

> This bug in page_referenced is walking the "same_anon_vma" list,
> which is locked with the anon_vma->lock.

Umm. Wake the hell up, Rik!

It's walking a _corrupt_ same_anon_vma list.  In other words, we _know_ 
that the 'anon_vma_chain' entry is crap. We know that exactly because it 
contains "impossible" values with regard to the list.

And what's the easiest way to get such a corrupt list, considering that 
the locking looks correct for that particular list?

That's right: by having something like anon_vma_clone() do something bad 
when it walks the same avc entries using the 'same_vma' list and creates 
copies of it.

You can't just say "but but but same_anon_vma list is always locked 
properly". Because it doesn't matter if that list is locked properly if 
walking _another_ list doesn't work right.

I really don't understand why you keep on harping on thatr same_anon_vma 
list. The fact that that was the corrupt list IN ABSOLUTELY NO WAY implies 
that that is the list that caused the corruption.

For example, let's say that the 'anon_vma_chain' list is corrupted. Never 
mind how. So what could happen is that you'd have vma->anon_vma pointing 
to one thing, and one or more entries on the 'vma->anon_vma_chain' list 
pointing to _another_ anon_vma.

What happens then? I have no idea. Maybe nothing bad. But the point is, if 
one avc list is corrupted and you may end up referencing those avc's in 
unexpected cases, how can you trust the other list that is in the same 
data structure?

For example, maybe some list corruption causes us to do that 
"anon_vma_chain_link()" _twice_ on the same avc entry. So we do that 
"list_add_tail(&avc->same_anon_vma, &anon_vma->head);" on an entry that 
already had "same_anon_vma" on one list.

No, I really don't see how that could happen, but my argument is that a 
corrupt list can do odd things. The same entry might end up pointing to 
itself, so that you end up freeing it twice or something.

Just as an example of the kind of code that makes me worry:

	void unlink_anon_vmas(struct vm_area_struct *vma)
	{
	        struct anon_vma_chain *avc, *next;
	                
	        /* Unlink each anon_vma chained to the VMA. */
	        list_for_each_entry_safe(avc, next, &vma->anon_vma_chain, same_vma) {
	                anon_vma_unlink(avc);
	                list_del(&avc->same_vma);
	                anon_vma_chain_free(avc);
	        }
	}

Now, think about what happens for the *last* entry in that avc chain. It 
will call that "anon_vma_unlink()" thing, which will delete perhaps the 
last entry in the "same_anon_vma" one, and then it does

	if (empty)
		anon_vma_free(anon_vma);

*before* unlink_anon_vma's has actually does that

	list_del(&avc->same_vma);

and what we essentially have is a stale anon_vma_chain entry that still 
exists on that same_vma list, and points to an anon_vma that already got 
deleted.

Does it matter? I really can't see that it does. But that's the kind of 
thing that makes me nervous. It makes me _especially_ nervous when the 
whole locking for that anon_vma_chain thing isn't entirely obvious.

		Linus

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3)
  2010-04-06 18:28                       ` Linus Torvalds
@ 2010-04-06 19:03                         ` Andrew Morton
  2010-04-06 19:10                           ` Steinar H. Gunderson
                                             ` (2 more replies)
  2010-04-07  8:36                         ` Peter Zijlstra
  1 sibling, 3 replies; 242+ messages in thread
From: Andrew Morton @ 2010-04-06 19:03 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Rik van Riel, Minchan Kim, KOSAKI Motohiro, Borislav Petkov,
	Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin,
	Andrea Arcangeli, Hugh Dickins, sgunderson

On Tue, 6 Apr 2010 11:28:52 -0700 (PDT)
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> For example, maybe some list corruption causes us to do that 
> "anon_vma_chain_link()" _twice_ on the same avc entry. So we do that 
> "list_add_tail(&avc->same_anon_vma, &anon_vma->head);" on an entry that 
> already had "same_anon_vma" on one list.

The lib/list_debug.c stuff might detect such things.  I wonder if
either Borislav or Steinar had CONFIG_DEBUG_LIST enabled?

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3)
  2010-04-06 19:03                         ` Andrew Morton
  2010-04-06 19:10                           ` Steinar H. Gunderson
@ 2010-04-06 19:10                           ` Linus Torvalds
  2010-04-06 19:35                             ` Linus Torvalds
  2010-04-06 19:42                           ` Borislav Petkov
  2 siblings, 1 reply; 242+ messages in thread
From: Linus Torvalds @ 2010-04-06 19:10 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Minchan Kim, KOSAKI Motohiro, Borislav Petkov,
	Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin,
	Andrea Arcangeli, Hugh Dickins, sgunderson



On Tue, 6 Apr 2010, Andrew Morton wrote:

> On Tue, 6 Apr 2010 11:28:52 -0700 (PDT)
> Linus Torvalds <torvalds@linux-foundation.org> wrote:
> 
> > For example, maybe some list corruption causes us to do that 
> > "anon_vma_chain_link()" _twice_ on the same avc entry. So we do that 
> > "list_add_tail(&avc->same_anon_vma, &anon_vma->head);" on an entry that 
> > already had "same_anon_vma" on one list.
> 
> The lib/list_debug.c stuff might detect such things.  I wonder if
> either Borislav or Steinar had CONFIG_DEBUG_LIST enabled?

Well, even without CONFIG_LIST_DEBUG we'd catch _some_ things, and 
conversely, even with LIST_DEBUG on we don't catch everything.

For example, doing list_del() twice on the same entry will die with a 
really nice pattern due to poisoning even without LIST_DEBUG.

But list_add() twice on the same entry will sadly silently succeed both 
with and without list debugging (the list debugging will check the target 
list head, but there is no way to check the "new->next/prev" entries).

Anyway, I've not actually found anything wrong in the same_vma locking. 
And I'm not at all convinced there is any list corruption there. My point 
was really only that
 (a) the locking rules seem very unclear and certainly not documented and 
 (b) corruption of one list could easily be the cause of corruption of 
     another list of the same structure.
but I don't actually see anything wrong anywhere.

			Linus

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3)
  2010-04-06 19:03                         ` Andrew Morton
@ 2010-04-06 19:10                           ` Steinar H. Gunderson
  2010-04-06 19:10                           ` Linus Torvalds
  2010-04-06 19:42                           ` Borislav Petkov
  2 siblings, 0 replies; 242+ messages in thread
From: Steinar H. Gunderson @ 2010-04-06 19:10 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Rik van Riel, Minchan Kim, KOSAKI Motohiro,
	Borislav Petkov, Linux Kernel Mailing List, Lee Schermerhorn,
	Nick Piggin, Andrea Arcangeli, Hugh Dickins

On Tue, Apr 06, 2010 at 12:03:15PM -0700, Andrew Morton wrote:
>> For example, maybe some list corruption causes us to do that 
>> "anon_vma_chain_link()" _twice_ on the same avc entry. So we do that 
>> "list_add_tail(&avc->same_anon_vma, &anon_vma->head);" on an entry that 
>> already had "same_anon_vma" on one list.
> The lib/list_debug.c stuff might detect such things.  I wonder if
> either Borislav or Steinar had CONFIG_DEBUG_LIST enabled?

Not set on my kernel.

/* Steinar */
-- 
Homepage: http://www.sesse.net/

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3)
  2010-04-06 19:10                           ` Linus Torvalds
@ 2010-04-06 19:35                             ` Linus Torvalds
  0 siblings, 0 replies; 242+ messages in thread
From: Linus Torvalds @ 2010-04-06 19:35 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Minchan Kim, KOSAKI Motohiro, Borislav Petkov,
	Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin,
	Andrea Arcangeli, Hugh Dickins, sgunderson



On Tue, 6 Apr 2010, Linus Torvalds wrote:
> 
> Anyway, I've not actually found anything wrong in the same_vma locking. 
> And I'm not at all convinced there is any list corruption there. My point 
> was really only that
>  (a) the locking rules seem very unclear and certainly not documented and 
>  (b) corruption of one list could easily be the cause of corruption of 
>      another list of the same structure.
> but I don't actually see anything wrong anywhere.

I _have_ found what looks like a few clues, though.

In particular, the disassembly in Steinar Gunderson's case looks much more 
like the disassembly I get, and if I read that correctly, it's actually 
the _first_ iteration of the for_each_entry() loop that crashes.

Why do I think so?

In Steinar's oops, we have "RAX: ffff880169111fc8", which is clearly a 
kernel pointer. However, the code from Steinar's oops decodes to:

   0:	3b 56 10             	cmp    0x10(%rsi),%edx
   3:	73 1e                	jae    0x23
   5:	48 83 fa f2          	cmp    $0xfffffffffffffff2,%rdx
   9:	74 18                	je     0x23
   b:	4d 89 f8             	mov    %r15,%r8
   e:	48 8d 4d cc          	lea    -0x34(%rbp),%rcx
  12:	4c 89 e7             	mov    %r12,%rdi
  15:	e8 44 f2 ff ff       	callq  0xfffffffffffff25e
  1a:	41 01 c5             	add    %eax,%r13d
  1d:	83 7d cc 00          	cmpl   $0x0,-0x34(%rbp)
  21:	74 19                	je     0x3c
  23:	48 8b 43 20          	mov    0x20(%rbx),%rax
  27:	48 8d 58 e0          	lea    -0x20(%rax),%rbx
  2b:*	48 8b 43 20          	mov    0x20(%rbx),%rax     <-- trapping instruction
  2f:	0f 18 08             	prefetcht0 (%rax)
  32:	48 8d 43 20          	lea    0x20(%rbx),%rax
  36:	48 39 45 88          	cmp    %rax,-0x78(%rbp)
  3a:	75 a7                	jne    0xffffffffffffffe3
  3c:	41 fe 06             	incb   (%r14)
  3f:	e9                   	.byte 0xe9

which matches my code pretty well, and the point is, _if_ it went through 
the loop, then %rbx should be %rax+20. And it's not.

IOW, the code you see above before the trapping instruction is the end of 
the loop: it's the

		referenced += page_referenced_one(page, vma, address,
				&mapcount, vm_flags);
		if (!mapcount)
			break;
	}

part (the "callq" and "add %eax" is that "referenced +=", and %r13d is 
"referenced").

What you cannot see from the code decode is the loop setup and _entry_, 
which looks like this for me:

        movl    12(%rbx), %eax  # <variable>.D.11299._mapcount.counter, D.33294
        xorl    %r12d, %r12d    # referenced
        incl    %eax    # tmp89
        movl    %eax, -52(%rbp) # tmp89, mapcount
        leaq    48(%r14), %rax  #,
        movq    48(%r14), %r13  # <variable>.head.next, <variable>.head.next
        movq    %rax, -128(%rbp)        #, %sfp
        subq    $32, %r13       #, avc
        jmp     .L167   #

where that "L167" is actually the oopsing instruction (ie the "while" loop 
has been turned around, and we jump to the end of the loop that does the 
loop end test).

In other words, what is NULL here is not an anon_vma_chain entry, but  
actually the initial "anon_vma->head.next" pointer.

The whole _head_ of the list has never been initialized, in other words.

So we can entirely ignore the 'anon_vma_chain' issues. We need to look at 
the initializations of the 'anon_vma's themselves.

			Linus

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3)
  2010-04-06 19:03                         ` Andrew Morton
  2010-04-06 19:10                           ` Steinar H. Gunderson
  2010-04-06 19:10                           ` Linus Torvalds
@ 2010-04-06 19:42                           ` Borislav Petkov
  2010-04-06 20:02                             ` Linus Torvalds
  2 siblings, 1 reply; 242+ messages in thread
From: Borislav Petkov @ 2010-04-06 19:42 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Rik van Riel, Minchan Kim, KOSAKI Motohiro,
	Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin,
	Andrea Arcangeli, Hugh Dickins, sgunderson

From: Andrew Morton <akpm@linux-foundation.org>
Date: Tue, Apr 06, 2010 at 12:03:15PM -0700

> On Tue, 6 Apr 2010 11:28:52 -0700 (PDT)
> Linus Torvalds <torvalds@linux-foundation.org> wrote:
> 
> > For example, maybe some list corruption causes us to do that 
> > "anon_vma_chain_link()" _twice_ on the same avc entry. So we do that 
> > "list_add_tail(&avc->same_anon_vma, &anon_vma->head);" on an entry that 
> > already had "same_anon_vma" on one list.
> 
> The lib/list_debug.c stuff might detect such things.  I wonder if
> either Borislav or Steinar had CONFIG_DEBUG_LIST enabled?

No, it is off in my .config. I'll turn it on and retest to see whether
it screams something. In the meantime, I've been testing current git
(v2.6.34-rc3-288-gab195c5), and especially Rik's mem leak fix which
Linus already committed (4946d54cb55e86a156216fcfeed5568514b0830f) and
tried to retrigger the bug by hibernating the machine several times.

Now, this machine has 8G of memory so I thought maybe if starting
several assorted guests on it would put some pressure on anon_vma lists
but no, the machine habernated happily by creating almost a 600Mb
hibernation image and having all three guests loaded.

Then, I said, well, let's have another last test run and started firefox
which went into reloading the last session. And I remember that firefox
still hadn't finished loading all pages when I hibernated and boom, it
oopsed.

So, it definitely is some anon_vma lists concurrency issue ... The good
thing is, I was able to catch the oops in its sheer magnificence over
netconsole this time:


[ 2995.478125] PM: Preallocating image memory... 
[ 2995.713692] BUG: unable to handle kernel NULL pointer dereference at (null)
[ 2995.714001] IP: [<ffffffff810c194d>] page_referenced+0xee/0x1dc
[ 2995.714001] PGD 22d1b8067 PUD 22dd85067 PMD 0 
[ 2995.714001] Oops: 0000 [#1] PREEMPT SMP 
[ 2995.714001] last sysfs file: /sys/power/state
[ 2995.714001] CPU 0 
[ 2995.714001] Modules linked in: tun powernow_k8 cpufreq_ondemand cpufreq_powersave cpufreq_userspace freq_table cpufreq_conservative binfmt_misc kvm_amd kvm ipv6 vfat fat dm_crypt dm_mod ohci_hcd pcspkr 8250_pnp 8250 k10temp edac_core serial_core
[ 2995.714001] 
[ 2995.714001] Pid: 7440, comm: hib.sh Not tainted 2.6.34-rc3-00288-gab195c5 #1 M3A78 PRO/System Product Name
[ 2995.714001] RIP: 0010:[<ffffffff810c194d>]  [<ffffffff810c194d>] page_referenced+0xee/0x1dc
[ 2995.714001] RSP: 0018:ffff88022fa038b8  EFLAGS: 00010283
[ 2995.714001] RAX: ffff88022d747098 RBX: ffffea00078efb70 RCX: 0000000000000000
[ 2995.714001] RDX: ffff88022fa03cf8 RSI: ffff88022d747070 RDI: ffff88022fb32520
[ 2995.714001] RBP: ffff88022fa03938 R08: 0000000000000002 R09: 0000000000000000
[ 2995.714001] R10: ffff88022fa038a8 R11: ffff88022d295d10 R12: 0000000000000000
[ 2995.714001] R13: ffffffffffffffe0 R14: ffff88022d747058 R15: ffff88022fa03a00
[ 2995.714001] FS:  00007f4da8b966f0(0000) GS:ffff88000a000000(0000) knlGS:0000000000000000
[ 2995.714001] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 2995.714001] CR2: 0000000000000000 CR3: 000000022d11e000 CR4: 00000000000006f0
[ 2995.714001] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 2995.714001] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 2995.714001] Process hib.sh (pid: 7440, threadinfo ffff88022fa02000, task ffff88022fb32520)
[ 2995.714001] Stack:
[ 2995.714001]  ffff88022d747098 00000000813fd2ac ffffffff8165ee28 0000000000000416
[ 2995.714001] <0> ffff88022fa038f8 ffffffff810c6d40 ffffea00078fae60 ffffea00078fae60
[ 2995.714001] <0> ffff88022fa03938 00000002810abd98 ffffea00078ec530 ffffea00078efb98
[ 2995.714001] Call Trace:
[ 2995.714001]  [<ffffffff810c6d40>] ? swapcache_free+0x37/0x3c
[ 2995.714001]  [<ffffffff810ac31d>] shrink_page_list+0x171/0x4b1
[ 2995.714001]  [<ffffffff813fd1e6>] ? _raw_spin_unlock_irq+0x30/0x58
[ 2995.714001]  [<ffffffff810ac9b9>] shrink_inactive_list+0x35c/0x623
[ 2995.714001]  [<ffffffff810acd94>] ? shrink_zone+0x114/0x3d4
[ 2995.714001]  [<ffffffff81064f29>] ? print_lock_contention_bug+0x1b/0xe1
[ 2995.714001]  [<ffffffff813fc790>] ? _raw_spin_lock_irq+0x19/0x79
[ 2995.714001]  [<ffffffff810acf8a>] shrink_zone+0x30a/0x3d4
[ 2995.714001]  [<ffffffff810ad19e>] ? shrink_slab+0x14a/0x15c
[ 2995.714001]  [<ffffffff810adb65>] do_try_to_free_pages+0x176/0x27f
[ 2995.714001]  [<ffffffff8103de67>] ? irq_exit+0x93/0x95
[ 2995.714001]  [<ffffffff810add03>] shrink_all_memory+0x95/0xc4
[ 2995.714001]  [<ffffffff810ab0f0>] ? isolate_pages_global+0x0/0x217
[ 2995.714001]  [<ffffffff81077503>] ? count_data_pages+0x65/0x79
[ 2995.714001]  [<ffffffff8107776a>] hibernate_preallocate_memory+0x1aa/0x2cb
[ 2995.714001]  [<ffffffff813f95b5>] ? printk+0x41/0x44
[ 2995.714001]  [<ffffffff810760b3>] hibernation_snapshot+0x36/0x1e1
[ 2995.714001]  [<ffffffff8107632c>] hibernate+0xce/0x172
[ 2995.714001]  [<ffffffff81075099>] state_store+0x5c/0xd3
[ 2995.714001]  [<ffffffff8118728f>] kobj_attr_store+0x17/0x19
[ 2995.714001]  [<ffffffff81127b69>] sysfs_write_file+0x108/0x144
[ 2995.714001]  [<ffffffff810d66ff>] vfs_write+0xb2/0x153
[ 2995.714001]  [<ffffffff810641a9>] ? trace_hardirqs_on_caller+0x1f/0x14b
[ 2995.714001]  [<ffffffff810d6863>] sys_write+0x4a/0x71
[ 2995.714001]  [<ffffffff810021db>] system_call_fastpath+0x16/0x1b
[ 2995.714001] Code: 3b 56 10 73 1e 48 83 fa f2 74 18 48 8d 4d cc 4d 89 f8 48 89 df e8 4d f2 ff ff 41 01 c4 83 7d cc 00 74 19 4d 8b 6d 20 49 83 ed 20 <49> 8b 45 20 0f 18 08 49 8d 45 20 48 39 45 80 75 aa 4c 89 f7 e8 
[ 2995.714001] RIP  [<ffffffff810c194d>] page_referenced+0xee/0x1dc
[ 2995.714001]  RSP <ffff88022fa038b8>
[ 2995.714001] CR2: 0000000000000000
[ 2995.729717] ---[ end trace 92c25d74e4800968 ]---
[ 2995.729862] note: hib.sh[7440] exited with preempt_count 2
[ 2995.730022] BUG: scheduling while atomic: hib.sh/7440/0x10000003
[ 2995.730170] INFO: lockdep is turned off.
[ 2995.730319] Modules linked in: tun powernow_k8 cpufreq_ondemand cpufreq_powersave cpufreq_userspace freq_table cpufreq_conservative binfmt_misc kvm_amd kvm ipv6 vfat fat dm_crypt dm_mod ohci_hcd pcspkr 8250_pnp 8250 k10temp edac_core serial_core
[ 2995.731749] Pid: 7440, comm: hib.sh Tainted: G      D    2.6.34-rc3-00288-gab195c5 #1
[ 2995.732003] Call Trace:
[ 2995.732158]  [<ffffffff810636bf>] ? __debug_show_held_locks+0x1b/0x24
[ 2995.732305]  [<ffffffff8102d499>] __schedule_bug+0x72/0x77
[ 2995.732454]  [<ffffffff813f9a0a>] schedule+0xd9/0x730
[ 2995.732603]  [<ffffffff81030301>] __cond_resched+0x18/0x24
[ 2995.732751]  [<ffffffff813fa12e>] _cond_resched+0x2c/0x37
[ 2995.732900]  [<ffffffff810b8a21>] unmap_vmas+0x6ce/0x893
[ 2995.733053]  [<ffffffff810bd0f5>] exit_mmap+0xd7/0x182
[ 2995.733206]  [<ffffffff81035b58>] mmput+0x43/0xea
[ 2995.733356]  [<ffffffff81039e99>] exit_mm+0x110/0x11d
[ 2995.733505]  [<ffffffff8103b8ed>] do_exit+0x1c5/0x6a2
[ 2995.733653]  [<ffffffff81038f84>] ? kmsg_dump+0x13b/0x155
[ 2995.733802]  [<ffffffff810060db>] ? oops_end+0x47/0x93
[ 2995.733950]  [<ffffffff81006122>] oops_end+0x8e/0x93
[ 2995.734102]  [<ffffffff8101ed99>] no_context+0x1fc/0x20b
[ 2995.734255]  [<ffffffff8101ef34>] __bad_area_nosemaphore+0x18c/0x1af
[ 2995.734407]  [<ffffffff8101f16f>] ? do_page_fault+0xa8/0x32d
[ 2995.734556]  [<ffffffff8101ef6a>] bad_area_nosemaphore+0x13/0x15
[ 2995.734705]  [<ffffffff8101f23a>] do_page_fault+0x173/0x32d
[ 2995.734854]  [<ffffffff810802f9>] ? __call_rcu+0x11d/0x130
[ 2995.735008]  [<ffffffff813fdaa3>] ? error_sti+0x5/0x6
[ 2995.735161]  [<ffffffff81063167>] ? trace_hardirqs_off_caller+0x1f/0xa9
[ 2995.735313]  [<ffffffff813fc48e>] ? trace_hardirqs_off_thunk+0x3a/0x3c
[ 2995.735463]  [<ffffffff813fd8bf>] page_fault+0x1f/0x30
[ 2995.735612]  [<ffffffff810c194d>] ? page_referenced+0xee/0x1dc
[ 2995.735761]  [<ffffffff810c18df>] ? page_referenced+0x80/0x1dc
[ 2995.735910]  [<ffffffff810c6d40>] ? swapcache_free+0x37/0x3c
[ 2995.736062]  [<ffffffff810ac31d>] shrink_page_list+0x171/0x4b1
[ 2995.736216]  [<ffffffff813fd1e6>] ? _raw_spin_unlock_irq+0x30/0x58
[ 2995.736368]  [<ffffffff810ac9b9>] shrink_inactive_list+0x35c/0x623
[ 2995.736518]  [<ffffffff810acd94>] ? shrink_zone+0x114/0x3d4
[ 2995.736666]  [<ffffffff81064f29>] ? print_lock_contention_bug+0x1b/0xe1
[ 2995.736816]  [<ffffffff813fc790>] ? _raw_spin_lock_irq+0x19/0x79
[ 2995.736965]  [<ffffffff810acf8a>] shrink_zone+0x30a/0x3d4
[ 2995.737117]  [<ffffffff810ad19e>] ? shrink_slab+0x14a/0x15c
[ 2995.737270]  [<ffffffff810adb65>] do_try_to_free_pages+0x176/0x27f
[ 2995.737422]  [<ffffffff8103de67>] ? irq_exit+0x93/0x95
[ 2995.737570]  [<ffffffff810add03>] shrink_all_memory+0x95/0xc4
[ 2995.737719]  [<ffffffff810ab0f0>] ? isolate_pages_global+0x0/0x217
[ 2995.737868]  [<ffffffff81077503>] ? count_data_pages+0x65/0x79
[ 2995.738020]  [<ffffffff8107776a>] hibernate_preallocate_memory+0x1aa/0x2cb
[ 2995.738175]  [<ffffffff813f95b5>] ? printk+0x41/0x44
[ 2995.738326]  [<ffffffff810760b3>] hibernation_snapshot+0x36/0x1e1
[ 2995.738475]  [<ffffffff8107632c>] hibernate+0xce/0x172
[ 2995.738623]  [<ffffffff81075099>] state_store+0x5c/0xd3
[ 2995.738772]  [<ffffffff8118728f>] kobj_attr_store+0x17/0x19
[ 2995.738920]  [<ffffffff81127b69>] sysfs_write_file+0x108/0x144
[ 2995.739073]  [<ffffffff810d66ff>] vfs_write+0xb2/0x153
[ 2995.739226]  [<ffffffff810641a9>] ? trace_hardirqs_on_caller+0x1f/0x14b
[ 2995.739378]  [<ffffffff810d6863>] sys_write+0x4a/0x71
[ 2995.739526]  [<ffffffff810021db>] system_call_fastpath+0x16/0x1b
[ 2995.739940] BUG: unable to handle kernel paging request at 00007faf064ff1f0
[ 2995.740220] IP: [<ffffffff8119c0d0>] do_raw_spin_trylock+0x4/0x3a
[ 2995.740441] PGD 0 
[ 2995.740646] Oops: 0000 [#2] PREEMPT SMP 
[ 2995.740685] last sysfs file: /sys/power/state
[ 2995.740685] CPU 1 
[ 2995.740685] Modules linked in: tun powernow_k8 cpufreq_ondemand cpufreq_powersave cpufreq_userspace freq_table cpufreq_conservative binfmt_misc kvm_amd kvm ipv6 vfat fat dm_crypt dm_mod ohci_hcd pcspkr 8250_pnp 8250 k10temp edac_core serial_core
[ 2995.740685] 
[ 2995.740685] Pid: 7440, comm: hib.sh Tainted: G      D    2.6.34-rc3-00288-gab195c5 #1 M3A78 PRO/System Product Name
[ 2995.740685] RIP: 0010:[<ffffffff8119c0d0>]  [<ffffffff8119c0d0>] do_raw_spin_trylock+0x4/0x3a
[ 2995.740685] RSP: 0018:ffff88022fa03438  EFLAGS: 00010292
[ 2995.740685] RAX: ffff88022fb32520 RBX: 00007faf064ff1f0 RCX: 0000000000000000
[ 2995.740685] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00007faf064ff1f0
[ 2995.740685] RBP: ffff88022fa03438 R08: 0000000000000002 R09: 0000000000000000
[ 2995.740685] R10: dead000000100100 R11: ffffffff810d26f5 R12: 00007faf064ff208
[ 2995.740685] R13: fffffffffffffff0 R14: ffff88022d747068 R15: 00007f4da81fa000
[ 2995.740685] FS:  00007f4da8b966f0(0000) GS:ffff88000a200000(0000) knlGS:0000000000000000
[ 2995.740685] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 2995.740685] CR2: 00007faf064ff1f0 CR3: 0000000001646000 CR4: 00000000000006e0
[ 2995.740685] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 2995.740685] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 2995.740685] Process hib.sh (pid: 7440, threadinfo ffff88022fa02000, task ffff88022fb32520)
[ 2995.740685] Stack:
[ 2995.740685]  ffff88022fa03468 ffffffff813fc6c3 ffffffff810c1ae3 ffff8801cbfde880
[ 2995.740685] <0> ffff88022fb32510 00007faf064ff1f0 ffff88022fa034a8 ffffffff810c1ae3
[ 2995.740685] <0> ffff88022fa034a8 ffff88022d747000 0000000000000000 0000000000000000
[ 2995.740685] Call Trace:
[ 2995.740685]  [<ffffffff813fc6c3>] _raw_spin_lock+0x48/0x73
[ 2995.740685]  [<ffffffff810c1ae3>] ? unlink_anon_vmas+0x40/0xe1
[ 2995.740685]  [<ffffffff810c1ae3>] unlink_anon_vmas+0x40/0xe1
[ 2995.740685]  [<ffffffff810bb562>] free_pgtables+0x68/0xce
[ 2995.740685]  [<ffffffff810bd11e>] exit_mmap+0x100/0x182
[ 2995.740685]  [<ffffffff81035b58>] mmput+0x43/0xea
[ 2995.740685]  [<ffffffff81039e99>] exit_mm+0x110/0x11d
[ 2995.740685]  [<ffffffff8103b8ed>] do_exit+0x1c5/0x6a2
[ 2995.740685]  [<ffffffff81038f84>] ? kmsg_dump+0x13b/0x155
[ 2995.740685]  [<ffffffff810060db>] ? oops_end+0x47/0x93
[ 2995.740685]  [<ffffffff81006122>] oops_end+0x8e/0x93
[ 2995.740685]  [<ffffffff8101ed99>] no_context+0x1fc/0x20b
[ 2995.740685]  [<ffffffff8101ef34>] __bad_area_nosemaphore+0x18c/0x1af
[ 2995.740685]  [<ffffffff8101f16f>] ? do_page_fault+0xa8/0x32d
[ 2995.740685]  [<ffffffff8101ef6a>] bad_area_nosemaphore+0x13/0x15
[ 2995.740685]  [<ffffffff8101f23a>] do_page_fault+0x173/0x32d
[ 2995.740685]  [<ffffffff810802f9>] ? __call_rcu+0x11d/0x130
[ 2995.740685]  [<ffffffff813fdaa3>] ? error_sti+0x5/0x6
[ 2995.740685]  [<ffffffff81063167>] ? trace_hardirqs_off_caller+0x1f/0xa9
[ 2995.740685]  [<ffffffff813fc48e>] ? trace_hardirqs_off_thunk+0x3a/0x3c
[ 2995.740685]  [<ffffffff813fd8bf>] page_fault+0x1f/0x30
[ 2995.740685]  [<ffffffff810c194d>] ? page_referenced+0xee/0x1dc
[ 2995.740685]  [<ffffffff810c18df>] ? page_referenced+0x80/0x1dc
[ 2995.740685]  [<ffffffff810c6d40>] ? swapcache_free+0x37/0x3c
[ 2995.740685]  [<ffffffff810ac31d>] shrink_page_list+0x171/0x4b1
[ 2995.740685]  [<ffffffff813fd1e6>] ? _raw_spin_unlock_irq+0x30/0x58
[ 2995.740685]  [<ffffffff810ac9b9>] shrink_inactive_list+0x35c/0x623
[ 2995.740685]  [<ffffffff810acd94>] ? shrink_zone+0x114/0x3d4
[ 2995.740685]  [<ffffffff81064f29>] ? print_lock_contention_bug+0x1b/0xe1
[ 2995.740685]  [<ffffffff813fc790>] ? _raw_spin_lock_irq+0x19/0x79
[ 2995.740685]  [<ffffffff810acf8a>] shrink_zone+0x30a/0x3d4
[ 2995.740685]  [<ffffffff810ad19e>] ? shrink_slab+0x14a/0x15c
[ 2995.740685]  [<ffffffff810adb65>] do_try_to_free_pages+0x176/0x27f
[ 2995.740685]  [<ffffffff8103de67>] ? irq_exit+0x93/0x95
[ 2995.740685]  [<ffffffff810add03>] shrink_all_memory+0x95/0xc4
[ 2995.740685]  [<ffffffff810ab0f0>] ? isolate_pages_global+0x0/0x217
[ 2995.740685]  [<ffffffff81077503>] ? count_data_pages+0x65/0x79
[ 2995.740685]  [<ffffffff8107776a>] hibernate_preallocate_memory+0x1aa/0x2cb
[ 2995.740685]  [<ffffffff813f95b5>] ? printk+0x41/0x44
[ 2995.740685]  [<ffffffff810760b3>] hibernation_snapshot+0x36/0x1e1
[ 2995.740685]  [<ffffffff8107632c>] hibernate+0xce/0x172
[ 2995.740685]  [<ffffffff81075099>] state_store+0x5c/0xd3
[ 2995.740685]  [<ffffffff8118728f>] kobj_attr_store+0x17/0x19
[ 2995.740685]  [<ffffffff81127b69>] sysfs_write_file+0x108/0x144
[ 2995.740685]  [<ffffffff810d66ff>] vfs_write+0xb2/0x153
[ 2995.740685]  [<ffffffff810641a9>] ? trace_hardirqs_on_caller+0x1f/0x14b
[ 2995.740685]  [<ffffffff810d6863>] sys_write+0x4a/0x71
[ 2995.740685]  [<ffffffff810021db>] system_call_fastpath+0x16/0x1b
[ 2995.740685] Code: c7 c7 90 16 67 81 e8 79 f1 25 00 48 c7 c7 90 16 67 81 e8 e1 e7 25 00 48 c7 c7 30 18 67 81 e8 d5 e7 25 00 c9 c3 90 90 55 48 89 e5 <0f> b7 07 38 e0 8d 90 00 01 00 00 75 05 f0 66 0f b1 17 0f 94 c2 
[ 2995.740685] RIP  [<ffffffff8119c0d0>] do_raw_spin_trylock+0x4/0x3a
[ 2995.740685]  RSP <ffff88022fa03438>
[ 2995.740685] CR2: 00007faf064ff1f0
[ 2995.762521] ---[ end trace 92c25d74e4800969 ]---
[ 2995.762686] Fixing recursive fault but reboot is needed!
[ 2995.762855] BUG: scheduling while atomic: hib.sh/7440/0x00000005
[ 2995.763026] INFO: lockdep is turned off.
[ 2995.763203] Modules linked in: tun powernow_k8 cpufreq_ondemand cpufreq_powersave cpufreq_userspace freq_table cpufreq_conservative binfmt_misc kvm_amd kvm ipv6 vfat fat dm_crypt dm_mod ohci_hcd pcspkr 8250_pnp 8250 k10temp edac_core serial_core
[ 2995.764799] Pid: 7440, comm: hib.sh Tainted: G      D    2.6.34-rc3-00288-gab195c5 #1
[ 2995.765080] Call Trace:
[ 2995.765256]  [<ffffffff810636bf>] ? __debug_show_held_locks+0x1b/0x24
[ 2995.765429]  [<ffffffff8102d499>] __schedule_bug+0x72/0x77
[ 2995.765600]  [<ffffffff813f9a0a>] schedule+0xd9/0x730
[ 2995.765771]  [<ffffffff8103b7f7>] do_exit+0xcf/0x6a2
[ 2995.765941]  [<ffffffff81038f84>] ? kmsg_dump+0x13b/0x155
[ 2995.766115]  [<ffffffff810060db>] ? oops_end+0x47/0x93
[ 2995.766295]  [<ffffffff81006122>] oops_end+0x8e/0x93
[ 2995.766462]  [<ffffffff8101ed99>] no_context+0x1fc/0x20b
[ 2995.766632]  [<ffffffff810641a9>] ? trace_hardirqs_on_caller+0x1f/0x14b
[ 2995.766806]  [<ffffffff8101ef34>] __bad_area_nosemaphore+0x18c/0x1af
[ 2995.766977]  [<ffffffff8101f16f>] ? do_page_fault+0xa8/0x32d
[ 2995.767161]  [<ffffffff8101ef6a>] bad_area_nosemaphore+0x13/0x15
[ 2995.767330]  [<ffffffff8101f23a>] do_page_fault+0x173/0x32d
[ 2995.767501]  [<ffffffff810a9eca>] ? release_pages+0x1ee/0x200
[ 2995.767673]  [<ffffffff813fdaa3>] ? error_sti+0x5/0x6
[ 2995.767842]  [<ffffffff81063167>] ? trace_hardirqs_off_caller+0x1f/0xa9
[ 2995.768017]  [<ffffffff813fc48e>] ? trace_hardirqs_off_thunk+0x3a/0x3c
[ 2995.768202]  [<ffffffff810d26f5>] ? kmem_cache_free+0x56/0x129
[ 2995.768373]  [<ffffffff813fd8bf>] page_fault+0x1f/0x30
[ 2995.768544]  [<ffffffff810d26f5>] ? kmem_cache_free+0x56/0x129
[ 2995.768716]  [<ffffffff8119c0d0>] ? do_raw_spin_trylock+0x4/0x3a
[ 2995.768888]  [<ffffffff813fc6c3>] _raw_spin_lock+0x48/0x73
[ 2995.769064]  [<ffffffff810c1ae3>] ? unlink_anon_vmas+0x40/0xe1
[ 2995.769246]  [<ffffffff810c1ae3>] unlink_anon_vmas+0x40/0xe1
[ 2995.769415]  [<ffffffff810bb562>] free_pgtables+0x68/0xce
[ 2995.769586]  [<ffffffff810bd11e>] exit_mmap+0x100/0x182
[ 2995.769756]  [<ffffffff81035b58>] mmput+0x43/0xea
[ 2995.769925]  [<ffffffff81039e99>] exit_mm+0x110/0x11d
[ 2995.770099]  [<ffffffff8103b8ed>] do_exit+0x1c5/0x6a2
[ 2995.770279]  [<ffffffff81038f84>] ? kmsg_dump+0x13b/0x155
[ 2995.770447]  [<ffffffff810060db>] ? oops_end+0x47/0x93
[ 2995.770616]  [<ffffffff81006122>] oops_end+0x8e/0x93
[ 2995.770785]  [<ffffffff8101ed99>] no_context+0x1fc/0x20b
[ 2995.770955]  [<ffffffff8101ef34>] __bad_area_nosemaphore+0x18c/0x1af
[ 2995.771141]  [<ffffffff8101f16f>] ? do_page_fault+0xa8/0x32d
[ 2995.771311]  [<ffffffff8101ef6a>] bad_area_nosemaphore+0x13/0x15
[ 2995.771482]  [<ffffffff8101f23a>] do_page_fault+0x173/0x32d
[ 2995.771653]  [<ffffffff810802f9>] ? __call_rcu+0x11d/0x130
[ 2995.771824]  [<ffffffff813fdaa3>] ? error_sti+0x5/0x6
[ 2995.771994]  [<ffffffff81063167>] ? trace_hardirqs_off_caller+0x1f/0xa9
[ 2995.772179]  [<ffffffff813fc48e>] ? trace_hardirqs_off_thunk+0x3a/0x3c
[ 2995.772352]  [<ffffffff813fd8bf>] page_fault+0x1f/0x30
[ 2995.772524]  [<ffffffff810c194d>] ? page_referenced+0xee/0x1dc
[ 2995.772696]  [<ffffffff810c18df>] ? page_referenced+0x80/0x1dc
[ 2995.772867]  [<ffffffff810c6d40>] ? swapcache_free+0x37/0x3c
[ 2995.773043]  [<ffffffff810ac31d>] shrink_page_list+0x171/0x4b1
[ 2995.773226]  [<ffffffff813fd1e6>] ? _raw_spin_unlock_irq+0x30/0x58
[ 2995.773398]  [<ffffffff810ac9b9>] shrink_inactive_list+0x35c/0x623
[ 2995.773572]  [<ffffffff810acd94>] ? shrink_zone+0x114/0x3d4
[ 2995.773742]  [<ffffffff81064f29>] ? print_lock_contention_bug+0x1b/0xe1
[ 2995.773916]  [<ffffffff813fc790>] ? _raw_spin_lock_irq+0x19/0x79
[ 2995.774093]  [<ffffffff810acf8a>] shrink_zone+0x30a/0x3d4
[ 2995.774274]  [<ffffffff810ad19e>] ? shrink_slab+0x14a/0x15c
[ 2995.774444]  [<ffffffff810adb65>] do_try_to_free_pages+0x176/0x27f
[ 2995.774617]  [<ffffffff8103de67>] ? irq_exit+0x93/0x95
[ 2995.774786]  [<ffffffff810add03>] shrink_all_memory+0x95/0xc4
[ 2995.774958]  [<ffffffff810ab0f0>] ? isolate_pages_global+0x0/0x217
[ 2995.775144]  [<ffffffff81077503>] ? count_data_pages+0x65/0x79
[ 2995.775314]  [<ffffffff8107776a>] hibernate_preallocate_memory+0x1aa/0x2cb
[ 2995.775487]  [<ffffffff813f95b5>] ? printk+0x41/0x44
[ 2995.775657]  [<ffffffff810760b3>] hibernation_snapshot+0x36/0x1e1
[ 2995.775828]  [<ffffffff8107632c>] hibernate+0xce/0x172
[ 2995.775998]  [<ffffffff81075099>] state_store+0x5c/0xd3
[ 2995.776182]  [<ffffffff8118728f>] kobj_attr_store+0x17/0x19
[ 2995.776350]  [<ffffffff81127b69>] sysfs_write_file+0x108/0x144
[ 2995.776521]  [<ffffffff810d66ff>] vfs_write+0xb2/0x153
[ 2995.776690]  [<ffffffff810641a9>] ? trace_hardirqs_on_caller+0x1f/0x14b
[ 2995.776863]  [<ffffffff810d6863>] sys_write+0x4a/0x71
[ 2995.777038]  [<ffffffff810021db>] system_call_fastpath+0x16/0x1b

-- 
Regards/Gruss,
Boris.

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3)
  2010-04-06 19:42                           ` Borislav Petkov
@ 2010-04-06 20:02                             ` Linus Torvalds
  2010-04-06 20:46                               ` Steinar H. Gunderson
                                                 ` (2 more replies)
  0 siblings, 3 replies; 242+ messages in thread
From: Linus Torvalds @ 2010-04-06 20:02 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Andrew Morton, Rik van Riel, Minchan Kim, KOSAKI Motohiro,
	Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin,
	Andrea Arcangeli, Hugh Dickins, sgunderson



On Tue, 6 Apr 2010, Borislav Petkov wrote:
> 
> [ 2995.478125] PM: Preallocating image memory... 
> [ 2995.713692] BUG: unable to handle kernel NULL pointer dereference at (null)
> [ 2995.714001] IP: [<ffffffff810c194d>] page_referenced+0xee/0x1dc
> [ 2995.714001] PGD 22d1b8067 PUD 22dd85067 PMD 0 
> [ 2995.714001] Oops: 0000 [#1] PREEMPT SMP 
> [ 2995.714001] last sysfs file: /sys/power/state
> [ 2995.714001] CPU 0 
> [ 2995.714001] Modules linked in: tun powernow_k8 cpufreq_ondemand cpufreq_powersave cpufreq_userspace freq_table cpufreq_conservative binfmt_misc kvm_amd kvm ipv6 vfat fat dm_crypt dm_mod ohci_hcd pcspkr 8250_pnp 8250 k10temp edac_core serial_core
> [ 2995.714001] 
> [ 2995.714001] Pid: 7440, comm: hib.sh Not tainted 2.6.34-rc3-00288-gab195c5 #1 M3A78 PRO/System Product Name
> [ 2995.714001] RIP: 0010:[<ffffffff810c194d>]  [<ffffffff810c194d>] page_referenced+0xee/0x1dc
> [ 2995.714001] RSP: 0018:ffff88022fa038b8  EFLAGS: 00010283
> [ 2995.714001] RAX: ffff88022d747098 RBX: ffffea00078efb70 RCX: 0000000000000000
> [ 2995.714001] RDX: ffff88022fa03cf8 RSI: ffff88022d747070 RDI: ffff88022fb32520
> [ 2995.714001] RBP: ffff88022fa03938 R08: 0000000000000002 R09: 0000000000000000
> [ 2995.714001] R10: ffff88022fa038a8 R11: ffff88022d295d10 R12: 0000000000000000
> [ 2995.714001] R13: ffffffffffffffe0 R14: ffff88022d747058 R15: ffff88022fa03a00
> [ 2995.714001] FS:  00007f4da8b966f0(0000) GS:ffff88000a000000(0000) knlGS:0000000000000000
> [ 2995.714001] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> [ 2995.714001] CR2: 0000000000000000 CR3: 000000022d11e000 CR4: 00000000000006f0
> [ 2995.714001] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 2995.714001] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> [ 2995.714001] Process hib.sh (pid: 7440, threadinfo ffff88022fa02000, task ffff88022fb32520)
> [ 2995.714001] Stack:
> [ 2995.714001]  ffff88022d747098 00000000813fd2ac ffffffff8165ee28 0000000000000416
> [ 2995.714001] <0> ffff88022fa038f8 ffffffff810c6d40 ffffea00078fae60 ffffea00078fae60
> [ 2995.714001] <0> ffff88022fa03938 00000002810abd98 ffffea00078ec530 ffffea00078efb98
> [ 2995.714001] Call Trace:
> [ 2995.714001]  [<ffffffff810c6d40>] ? swapcache_free+0x37/0x3c
> [ 2995.714001]  [<ffffffff810ac31d>] shrink_page_list+0x171/0x4b1
> [ 2995.714001]  [<ffffffff813fd1e6>] ? _raw_spin_unlock_irq+0x30/0x58
> [ 2995.714001]  [<ffffffff810ac9b9>] shrink_inactive_list+0x35c/0x623
> [ 2995.714001]  [<ffffffff810acd94>] ? shrink_zone+0x114/0x3d4
> [ 2995.714001]  [<ffffffff81064f29>] ? print_lock_contention_bug+0x1b/0xe1
> [ 2995.714001]  [<ffffffff813fc790>] ? _raw_spin_lock_irq+0x19/0x79
> [ 2995.714001]  [<ffffffff810acf8a>] shrink_zone+0x30a/0x3d4
> [ 2995.714001]  [<ffffffff810ad19e>] ? shrink_slab+0x14a/0x15c
> [ 2995.714001]  [<ffffffff810adb65>] do_try_to_free_pages+0x176/0x27f
> [ 2995.714001]  [<ffffffff8103de67>] ? irq_exit+0x93/0x95
> [ 2995.714001]  [<ffffffff810add03>] shrink_all_memory+0x95/0xc4
> [ 2995.714001]  [<ffffffff810ab0f0>] ? isolate_pages_global+0x0/0x217
> [ 2995.714001]  [<ffffffff81077503>] ? count_data_pages+0x65/0x79
> [ 2995.714001]  [<ffffffff8107776a>] hibernate_preallocate_memory+0x1aa/0x2cb
> [ 2995.714001]  [<ffffffff813f95b5>] ? printk+0x41/0x44
> [ 2995.714001]  [<ffffffff810760b3>] hibernation_snapshot+0x36/0x1e1
> [ 2995.714001]  [<ffffffff8107632c>] hibernate+0xce/0x172
> [ 2995.714001]  [<ffffffff81075099>] state_store+0x5c/0xd3
> [ 2995.714001]  [<ffffffff8118728f>] kobj_attr_store+0x17/0x19
> [ 2995.714001]  [<ffffffff81127b69>] sysfs_write_file+0x108/0x144
> [ 2995.714001]  [<ffffffff810d66ff>] vfs_write+0xb2/0x153
> [ 2995.714001]  [<ffffffff810641a9>] ? trace_hardirqs_on_caller+0x1f/0x14b
> [ 2995.714001]  [<ffffffff810d6863>] sys_write+0x4a/0x71
> [ 2995.714001]  [<ffffffff810021db>] system_call_fastpath+0x16/0x1b
> [ 2995.714001] Code: 3b 56 10 73 1e 48 83 fa f2 74 18 48 8d 4d cc 4d 89 f8 48 89 df e8 4d f2 ff ff 41 01 c4 83 7d cc 00 74 19 4d 8b 6d 20 49 83 ed 20 <49> 8b 45 20 0f 18 08 49 8d 45 20 48 39 45 80 75 aa 4c 89 f7 e8 
> [ 2995.714001] RIP  [<ffffffff810c194d>] page_referenced+0xee/0x1dc
> [ 2995.714001]  RSP <ffff88022fa038b8>
> [ 2995.714001] CR2: 0000000000000000
> [ 2995.729717] ---[ end trace 92c25d74e4800968 ]---

So again, I can show that the code has never actually been through the 
loop. The above code decodes to:

   0:	3b 56 10             	cmp    0x10(%rsi),%edx
   3:	73 1e                	jae    0x23
   5:	48 83 fa f2          	cmp    $0xfffffffffffffff2,%rdx
   9:	74 18                	je     0x23
   b:	48 8d 4d cc          	lea    -0x34(%rbp),%rcx
   f:	4d 89 f8             	mov    %r15,%r8
  12:	48 89 df             	mov    %rbx,%rdi
  15:	e8 4d f2 ff ff       	callq  0xfffffffffffff267
  1a:	41 01 c4             	add    %eax,%r12d
  1d:	83 7d cc 00          	cmpl   $0x0,-0x34(%rbp)
  21:	74 19                	je     0x3c
  23:	4d 8b 6d 20          	mov    0x20(%r13),%r13
  27:	49 83 ed 20          	sub    $0x20,%r13
  2b:*	49 8b 45 20          	mov    0x20(%r13),%rax     <-- trapping instruction
  2f:	0f 18 08             	prefetcht0 (%rax)
  32:	49 8d 45 20          	lea    0x20(%r13),%rax
  36:	48 39 45 80          	cmp    %rax,-0x80(%rbp)
  3a:	75 aa                	jne    0xffffffffffffffe6
  3c:	4c 89 f7             	mov    %r14,%rdi
  3f:	e8                   	.byte 0xe8

and in your case, if we had gone through the loop, then %rax would still 
contain the return value from page_referenced_one(). 

But %rax is a kernel pointer, and %r12d is 0.

So again, it's actually anon_vma.head.next that is NULL, not any of the 
entries on the list itself.

Now, I can see several cases for this:

 - the obvious one: anon_vma just wasn't correctly initialized, and is 
   missing a INIT_LIST_HEAD(&anon_vma->head). That's either a slab bug (we 
   don't have a whole lot of coverage of constructors), or somebody 
   allocated an anon_vma without using the anon_vma_cachep.

 - Related to the above: perhaps the RCU freeing isn't working, or 
   slub/slab/slob ends up reusing the allocations for something else than 
   anonvma's, so together with the race _and_ an unlucky re-use, you get 
   some odd crud.

   I haven't looked at the kernel config files: do they perhaps share the 
   same (odd?) SLUB/SLAB/SLOB config?

 - anon_vma isn't actually an anonvma at all. 'page->mapping' was crud 
   with the low bit set. That sounds unlikely, but who knows. The ksm code 
   sets mapping to "stable_node + PAGE_MAPPING_ANON | PAGE_MAPPING_KSM"

   Did people have KSM enabled?

.. and probably other things I haven't even thought about.

		Linus

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3)
  2010-04-06 20:02                             ` Linus Torvalds
@ 2010-04-06 20:46                               ` Steinar H. Gunderson
  2010-04-06 20:56                                 ` Linus Torvalds
  2010-04-06 20:51                               ` Borislav Petkov
  2010-04-07  8:41                               ` Peter Zijlstra
  2 siblings, 1 reply; 242+ messages in thread
From: Steinar H. Gunderson @ 2010-04-06 20:46 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Borislav Petkov, Andrew Morton, Rik van Riel, Minchan Kim,
	KOSAKI Motohiro, Linux Kernel Mailing List, Lee Schermerhorn,
	Nick Piggin, Andrea Arcangeli, Hugh Dickins

On Tue, Apr 06, 2010 at 01:02:35PM -0700, Linus Torvalds wrote:
>    I haven't looked at the kernel config files: do they perhaps share the 
>    same (odd?) SLUB/SLAB/SLOB config?

http://storage.sesse.net/config-crashing-2.6.34-rc2

>    Did people have KSM enabled?

No KSM for me.

/* Steinar */
-- 
Homepage: http://www.sesse.net/

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3)
  2010-04-06 20:02                             ` Linus Torvalds
  2010-04-06 20:46                               ` Steinar H. Gunderson
@ 2010-04-06 20:51                               ` Borislav Petkov
  2010-04-06 21:27                                 ` Linus Torvalds
  2010-04-07  8:41                               ` Peter Zijlstra
  2 siblings, 1 reply; 242+ messages in thread
From: Borislav Petkov @ 2010-04-06 20:51 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, Rik van Riel, Minchan Kim, KOSAKI Motohiro,
	Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin,
	Andrea Arcangeli, Hugh Dickins, sgunderson

From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Tue, Apr 06, 2010 at 01:02:35PM -0700

> So again, I can show that the code has never actually been through the 
> loop. The above code decodes to:
> 
>    0:	3b 56 10             	cmp    0x10(%rsi),%edx
>    3:	73 1e                	jae    0x23
>    5:	48 83 fa f2          	cmp    $0xfffffffffffffff2,%rdx
>    9:	74 18                	je     0x23
>    b:	48 8d 4d cc          	lea    -0x34(%rbp),%rcx
>    f:	4d 89 f8             	mov    %r15,%r8
>   12:	48 89 df             	mov    %rbx,%rdi
>   15:	e8 4d f2 ff ff       	callq  0xfffffffffffff267
>   1a:	41 01 c4             	add    %eax,%r12d
>   1d:	83 7d cc 00          	cmpl   $0x0,-0x34(%rbp)
>   21:	74 19                	je     0x3c
>   23:	4d 8b 6d 20          	mov    0x20(%r13),%r13
>   27:	49 83 ed 20          	sub    $0x20,%r13
>   2b:*	49 8b 45 20          	mov    0x20(%r13),%rax     <-- trapping instruction
>   2f:	0f 18 08             	prefetcht0 (%rax)
>   32:	49 8d 45 20          	lea    0x20(%r13),%rax
>   36:	48 39 45 80          	cmp    %rax,-0x80(%rbp)
>   3a:	75 aa                	jne    0xffffffffffffffe6
>   3c:	4c 89 f7             	mov    %r14,%rdi
>   3f:	e8                   	.byte 0xe8
> 
> and in your case, if we had gone through the loop, then %rax would still 
> contain the return value from page_referenced_one(). 
> 
> But %rax is a kernel pointer, and %r12d is 0.
> 
> So again, it's actually anon_vma.head.next that is NULL, not any of the 
> entries on the list itself.
> 
> Now, I can see several cases for this:
> 
>  - the obvious one: anon_vma just wasn't correctly initialized, and is 
>    missing a INIT_LIST_HEAD(&anon_vma->head). That's either a slab bug (we 
>    don't have a whole lot of coverage of constructors), or somebody 
>    allocated an anon_vma without using the anon_vma_cachep.

I've added code to verify this and am suspend/resuming now... Wait a
minute, Linus, you're good! :) :

[  873.083074] PM: Preallocating image memory... 
[  873.254359] NULL anon_vma->head.next, page 2182681

This is the page_to_pfn number.

Now, how do we track back to the place which is missing anon_vma->head
init? Can we use the struct page *page arg to page_referenced_anon()
somehow?

[  873.254654] Pid: 3642, comm: hib.sh Not tainted 2.6.34-rc3-00288-gab195c5-dirty #3
[  873.254904] Call Trace:
[  873.255063]  [<ffffffff810c0c28>] page_referenced+0xd3/0x219
[  873.255212]  [<ffffffff810c5fb0>] ? swapcache_free+0x37/0x3c
[  873.255364]  [<ffffffff810ab782>] shrink_page_list+0x14a/0x477
[  873.255512]  [<ffffffff810aa6e0>] ? isolate_pages_global+0xc4/0x1f0
[  873.255662]  [<ffffffff813f8a76>] ? _raw_spin_unlock_irq+0x30/0x58
[  873.255811]  [<ffffffff810abe06>] shrink_inactive_list+0x357/0x5e5
[  873.255960]  [<ffffffff810ab626>] ? shrink_active_list+0x232/0x244
[  873.256112]  [<ffffffff810ac39e>] shrink_zone+0x30a/0x3d4
[  873.256264]  [<ffffffff810acf79>] do_try_to_free_pages+0x176/0x27f
[  873.256416]  [<ffffffff810ad117>] shrink_all_memory+0x95/0xc4
[  873.256564]  [<ffffffff810aa61c>] ? isolate_pages_global+0x0/0x1f0
[  873.256713]  [<ffffffff81076e4c>] ? count_data_pages+0x65/0x79
[  873.256862]  [<ffffffff810770b3>] hibernate_preallocate_memory+0x1aa/0x2cb
[  873.257036]  [<ffffffff813f4f75>] ? printk+0x41/0x44
[  873.257186]  [<ffffffff81075a53>] hibernation_snapshot+0x36/0x1e1
[  873.257337]  [<ffffffff81075ccc>] hibernate+0xce/0x172
[  873.257485]  [<ffffffff81074a39>] state_store+0x5c/0xd3
[  873.257634]  [<ffffffff81184eff>] kobj_attr_store+0x17/0x19
[  873.257783]  [<ffffffff81125d43>] sysfs_write_file+0x108/0x144
[  873.257932]  [<ffffffff810d560f>] vfs_write+0xb2/0x153
[  873.258084]  [<ffffffff81063bd9>] ? trace_hardirqs_on_caller+0x1f/0x14b
[  873.258237]  [<ffffffff810d5773>] sys_write+0x4a/0x71
[  873.258388]  [<ffffffff810021db>] system_call_fastpath+0x16/0x1b


>  - Related to the above: perhaps the RCU freeing isn't working, or 
>    slub/slab/slob ends up reusing the allocations for something else than 
>    anonvma's, so together with the race _and_ an unlucky re-use, you get 
>    some odd crud.
> 
>    I haven't looked at the kernel config files: do they perhaps share the 
>    same (odd?) SLUB/SLAB/SLOB config?

what is an odd SL[AOU]B config?

>  - anon_vma isn't actually an anonvma at all. 'page->mapping' was crud 
>    with the low bit set. That sounds unlikely, but who knows. The ksm code 
>    sets mapping to "stable_node + PAGE_MAPPING_ANON | PAGE_MAPPING_KSM"
> 
>    Did people have KSM enabled?

Nope, KSM is off here.

-- 
Regards/Gruss,
Boris.

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3)
  2010-04-06 20:46                               ` Steinar H. Gunderson
@ 2010-04-06 20:56                                 ` Linus Torvalds
  2010-04-06 21:05                                   ` Steinar H. Gunderson
  0 siblings, 1 reply; 242+ messages in thread
From: Linus Torvalds @ 2010-04-06 20:56 UTC (permalink / raw)
  To: Steinar H. Gunderson
  Cc: Borislav Petkov, Andrew Morton, Rik van Riel, Minchan Kim,
	KOSAKI Motohiro, Linux Kernel Mailing List, Lee Schermerhorn,
	Nick Piggin, Andrea Arcangeli, Hugh Dickins



On Tue, 6 Apr 2010, Steinar H. Gunderson wrote:

> On Tue, Apr 06, 2010 at 01:02:35PM -0700, Linus Torvalds wrote:
> >    I haven't looked at the kernel config files: do they perhaps share the 
> >    same (odd?) SLUB/SLAB/SLOB config?
> 
> http://storage.sesse.net/config-crashing-2.6.34-rc2

Ok, CONFIG_SLUB, which is the common case. Not likely to be buggy.

> >    Did people have KSM enabled?
> 
> No KSM for me.

Ok, not anything odd there either, and you're not using any odd RCU setup 
either. Nothing odd at all strikes me about your config, in fact. Lots and 
lots of modules, but I guess it comes from some distro default config..

		Linus

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3)
  2010-04-06 20:56                                 ` Linus Torvalds
@ 2010-04-06 21:05                                   ` Steinar H. Gunderson
  0 siblings, 0 replies; 242+ messages in thread
From: Steinar H. Gunderson @ 2010-04-06 21:05 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Borislav Petkov, Andrew Morton, Rik van Riel, Minchan Kim,
	KOSAKI Motohiro, Linux Kernel Mailing List, Lee Schermerhorn,
	Nick Piggin, Andrea Arcangeli, Hugh Dickins

On Tue, Apr 06, 2010 at 01:56:19PM -0700, Linus Torvalds wrote:
>>>    Did people have KSM enabled?
>> No KSM for me.
> Ok, not anything odd there either, and you're not using any odd RCU setup 
> either. Nothing odd at all strikes me about your config, in fact. Lots and 
> lots of modules, but I guess it comes from some distro default config..

I think it was originally some distro config, yes, but that «config fork» was
at 2.6.16 or something...

/* Steinar */
-- 
Homepage: http://www.sesse.net/

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3)
  2010-04-06 20:51                               ` Borislav Petkov
@ 2010-04-06 21:27                                 ` Linus Torvalds
  2010-04-06 22:59                                   ` Borislav Petkov
  2010-04-06 23:22                                   ` Rik van Riel
  0 siblings, 2 replies; 242+ messages in thread
From: Linus Torvalds @ 2010-04-06 21:27 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Andrew Morton, Rik van Riel, Minchan Kim, KOSAKI Motohiro,
	Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin,
	Andrea Arcangeli, Hugh Dickins, sgunderson



On Tue, 6 Apr 2010, Borislav Petkov wrote:
> > So again, it's actually anon_vma.head.next that is NULL, not any of the 
> > entries on the list itself.
> > 
> > Now, I can see several cases for this:
> > 
> >  - the obvious one: anon_vma just wasn't correctly initialized, and is 
> >    missing a INIT_LIST_HEAD(&anon_vma->head). That's either a slab bug (we 
> >    don't have a whole lot of coverage of constructors), or somebody 
> >    allocated an anon_vma without using the anon_vma_cachep.
> 
> I've added code to verify this and am suspend/resuming now... Wait a
> minute, Linus, you're good! :) :
> 
> [  873.083074] PM: Preallocating image memory... 
> [  873.254359] NULL anon_vma->head.next, page 2182681

Yeah, I was pretty sure of that thing.

I still don't see _how_ it happens, though. That 'struct anon_vma' is very 
simple, and contains literally just the lock and that list_head.

Now, 'head.next' is kind of magical, because it contains that magic 
low-bit "have I been locked" thing (see "vm_lock_anon_vma()" in 
mm/mmap.c). But I'm not seeing anything else touching it.

And if you allocate a anon_vma the proper way, the SLUB constructor should 
have made sure that the head is initialized. And no normal list operation 
ever sets any list pointer to zero, although a "list_del()" on the first 
list entry could do it if that first list entry had a NULL next pointer. 

> Now, how do we track back to the place which is missing anon_vma->head
> init? Can we use the struct page *page arg to page_referenced_anon()
> somehow?

You might enable SLUB debugging (both SLUB_DEBUG _and_ SLUB_DEBUG_ON), and 
then make the "object_err()" function in mm/slub.c be non-static. You 
could call it when you see the problem, perhaps.

Or you could just add tests to both alloc_anon_vma() and free_anon_vma() 
to check that 'list_empty(&anon_vma->head)' is true. I dunno.

> >    I haven't looked at the kernel config files: do they perhaps share the 
> >    same (odd?) SLUB/SLAB/SLOB config?
> 
> what is an odd SL[AOU]B config?

Probably anything but the default SLUB these days.  But Steinar already 
said he had SLUB, so it's unlikely to be something odd.

> >  - anon_vma isn't actually an anonvma at all. 'page->mapping' was crud 
> >    with the low bit set. That sounds unlikely, but who knows. The ksm code 
> >    sets mapping to "stable_node + PAGE_MAPPING_ANON | PAGE_MAPPING_KSM"
> > 
> >    Did people have KSM enabled?
> 
> Nope, KSM is off here.

Yeah, wasn't for Steinar either. So it doesn't look like it's any odd 
corner case that depends on some odd configuration.

		Linus

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3)
  2010-04-06 21:27                                 ` Linus Torvalds
@ 2010-04-06 22:59                                   ` Borislav Petkov
  2010-04-06 23:27                                     ` Linus Torvalds
  2010-04-06 23:37                                     ` Linus Torvalds
  2010-04-06 23:22                                   ` Rik van Riel
  1 sibling, 2 replies; 242+ messages in thread
From: Borislav Petkov @ 2010-04-06 22:59 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, Rik van Riel, Minchan Kim, KOSAKI Motohiro,
	Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin,
	Andrea Arcangeli, Hugh Dickins, sgunderson

From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Tue, Apr 06, 2010 at 02:27:37PM -0700

> On Tue, 6 Apr 2010, Borislav Petkov wrote:
> > > So again, it's actually anon_vma.head.next that is NULL, not any of the 
> > > entries on the list itself.
> > > 
> > > Now, I can see several cases for this:
> > > 
> > >  - the obvious one: anon_vma just wasn't correctly initialized, and is 
> > >    missing a INIT_LIST_HEAD(&anon_vma->head). That's either a slab bug (we 
> > >    don't have a whole lot of coverage of constructors), or somebody 
> > >    allocated an anon_vma without using the anon_vma_cachep.
> > 
> > I've added code to verify this and am suspend/resuming now... Wait a
> > minute, Linus, you're good! :) :
> > 
> > [  873.083074] PM: Preallocating image memory... 
> > [  873.254359] NULL anon_vma->head.next, page 2182681
> 
> Yeah, I was pretty sure of that thing.
> 
> I still don't see _how_ it happens, though. That 'struct anon_vma' is very 
> simple, and contains literally just the lock and that list_head.
> 
> Now, 'head.next' is kind of magical, because it contains that magic 
> low-bit "have I been locked" thing (see "vm_lock_anon_vma()" in 
> mm/mmap.c). But I'm not seeing anything else touching it.
> 
> And if you allocate a anon_vma the proper way, the SLUB constructor should 
> have made sure that the head is initialized. And no normal list operation 
> ever sets any list pointer to zero, although a "list_del()" on the first 
> list entry could do it if that first list entry had a NULL next pointer. 
> 
> > Now, how do we track back to the place which is missing anon_vma->head
> > init? Can we use the struct page *page arg to page_referenced_anon()
> > somehow?
> 
> You might enable SLUB debugging (both SLUB_DEBUG _and_ SLUB_DEBUG_ON), and 
> then make the "object_err()" function in mm/slub.c be non-static. You 
> could call it when you see the problem, perhaps.
> 
> Or you could just add tests to both alloc_anon_vma() and free_anon_vma() 
> to check that 'list_empty(&anon_vma->head)' is true. I dunno.

Ok, I tried doing all you suggested and here's what came out. Please,
take this with a grain of salt because I'm almost falling asleep - even
the coffee is not working anymore so it could be just as well that I've
made a mistake somewhere (the new OOPS is a #GP, by the way), just
watch:

Source changes locally:

--
diff --git a/include/linux/slab.h b/include/linux/slab.h
index 4884462..0c11dfb 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -108,6 +108,8 @@ unsigned int kmem_cache_size(struct kmem_cache *);
 const char *kmem_cache_name(struct kmem_cache *);
 int kmem_ptr_validate(struct kmem_cache *cachep, const void *ptr);
 
+void object_err(struct kmem_cache *s, struct page *page, u8 *object, char *reason);
+
 /*
  * Please use this macro to create slab caches. Simply specify the
  * name of the structure and maybe some flags that are listed above.
diff --git a/mm/rmap.c b/mm/rmap.c
index eaa7a09..7b35b3f 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -66,11 +66,24 @@ static struct kmem_cache *anon_vma_chain_cachep;
 
 static inline struct anon_vma *anon_vma_alloc(void)
 {
-	return kmem_cache_alloc(anon_vma_cachep, GFP_KERNEL);
+	struct anon_vma *ret;
+	ret = kmem_cache_alloc(anon_vma_cachep, GFP_KERNEL);
+
+	if (!ret->head.next) {
+		printk("%s NULL anon_vma->head.next\n", __func__);
+		dump_stack();
+	}
+
+	return ret;
 }
 
 void anon_vma_free(struct anon_vma *anon_vma)
 {
+	if (!anon_vma->head.next) {
+		printk("%s NULL anon_vma->head.next\n", __func__);
+		dump_stack();
+	}
+
 	kmem_cache_free(anon_vma_cachep, anon_vma);
 }
 
@@ -494,6 +507,18 @@ static int page_referenced_anon(struct page *page,
 		return referenced;
 
 	mapcount = page_mapcount(page);
+
+	if (!anon_vma->head.next) {
+		printk(KERN_ERR "NULL anon_vma->head.next, page %lu\n",
+				page_to_pfn(page));
+
+		object_err(anon_vma_cachep, page, (u8 *)anon_vma, "NULL next");
+
+		dump_stack();
+
+		return referenced;
+	}
+
 	list_for_each_entry(avc, &anon_vma->head, same_anon_vma) {
 		struct vm_area_struct *vma = avc->vma;
 		unsigned long address = vma_address(page, vma);
diff --git a/mm/slub.c b/mm/slub.c
index b364844..bcf5416 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -477,7 +477,7 @@ static void print_trailer(struct kmem_cache *s, struct page *page, u8 *p)
 	dump_stack();
 }
 
-static void object_err(struct kmem_cache *s, struct page *page,
+void object_err(struct kmem_cache *s, struct page *page,
 			u8 *object, char *reason)
 {
 	slab_bug(s, "%s", reason);

---

do the same exercise of starting several guests and then shutting them
down, and hibernating at the same time. After having shutdown the
guests, start firefox and let it load a big html page and hibernate
while doing so, boom!

[  269.104940] Freezing user space processes ... (elapsed 0.03 seconds) done.
[  269.141953] Freezing remaining freezable tasks ... (elapsed 0.01 seconds) done.
[  269.155115] PM: Preallocating image memory... 
[  269.423811] general protection fault: 0000 [#1] PREEMPT SMP 
[  269.424003] last sysfs file: /sys/power/state
[  269.424003] CPU 0 
[  269.424003] Modules linked in: powernow_k8 cpufreq_ondemand cpufreq_powersave cpufreq_userspace freq_table cpufreq_co
nservative binfmt_misc kvm_amd kvm ipv6 vfat fat dm_crypt dm_mod ohci_hcd pcspkr edac_core k10temp 8250_pnp 8250 serial_
core
[  269.424003] 
[  269.424003] Pid: 2617, comm: hib.sh Tainted: G        W  2.6.34-rc3-00288-gab195c5-dirty #4 M3A78 PRO/System Product 
Name
[  269.424003] RIP: 0010:[<ffffffff810c0cb4>]  [<ffffffff810c0cb4>] page_referenced+0x147/0x232
[  269.424003] RSP: 0018:ffff88022a1218b8  EFLAGS: 00010246
[  269.424003] RAX: ffff8802126fa468 RBX: ffffea000700b210 RCX: 0000000000000000
[  269.424003] RDX: ffff8802126fa429 RSI: ffff8802126fa440 RDI: ffff88022dc3cb80
[  269.424003] RBP: ffff88022a121938 R08: 0000000000000002 R09: 0000000000000000
[  269.424003] R10: 0000000000000246 R11: ffff88021a030478 R12: 0000000000000000
[  269.424003] R13: 002e2e2e002e2e0e R14: ffff8802126fa428 R15: ffff88022a121a00
[  269.424003] FS:  00007fe2799796f0(0000) GS:ffff88000a000000(0000) knlGS:0000000000000000
[  269.424003] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[  269.424003] CR2: 00007fffdefb3880 CR3: 00000002171c0000 CR4: 00000000000006f0
[  269.424003] DR0: 0000000000000090 DR1: 00000000000000a4 DR2: 00000000000000ff
[  269.424003] DR3: 000000000000000f DR6: 00000000ffff0ff0 DR7: 0000000000000400
[  269.424003] Process hib.sh (pid: 2617, threadinfo ffff88022a120000, task ffff88022dc3cb80)
[  269.424003] Stack:
[  269.424003]  ffff8802126fa468 00000000813f8cfc ffffffff8165ae28 00000000000042e7
[  269.424003] <0> ffff88022a1218f8 ffffffff810c6051 ffffea0006f968c8 ffffea0006f968c8
[  269.424003] <0> ffff88022a121938 00000002810ab275 0000000006f96890 ffffea000700b238
[  269.424003] Call Trace:
[  269.424003]  [<ffffffff810c6051>] ? swapcache_free+0x37/0x3c
[  269.424003]  [<ffffffff810ab79a>] shrink_page_list+0x14a/0x477
[  269.424003]  [<ffffffff813f8c36>] ? _raw_spin_unlock_irq+0x30/0x58
[  269.424003]  [<ffffffff810abe1e>] shrink_inactive_list+0x357/0x5e5
[  269.424003]  [<ffffffff810ac3b6>] shrink_zone+0x30a/0x3d4
[  269.424003]  [<ffffffff810acf91>] do_try_to_free_pages+0x176/0x27f
[  269.424003]  [<ffffffff810ad12f>] shrink_all_memory+0x95/0xc4
[  269.424003]  [<ffffffff810aa634>] ? isolate_pages_global+0x0/0x1f0
[  269.424003]  [<ffffffff81076e64>] ? count_data_pages+0x65/0x79
[  269.424003]  [<ffffffff810770cb>] hibernate_preallocate_memory+0x1aa/0x2cb
[  269.424003]  [<ffffffff813f5135>] ? printk+0x41/0x44
[  269.424003]  [<ffffffff81075a6b>] hibernation_snapshot+0x36/0x1e1
[  269.424003]  [<ffffffff81075ce4>] hibernate+0xce/0x172
[  269.424003]  [<ffffffff81074a51>] state_store+0x5c/0xd3
[  269.424003]  [<ffffffff81185097>] kobj_attr_store+0x17/0x19
[  269.424003]  [<ffffffff81125edb>] sysfs_write_file+0x108/0x144
[  269.424003]  [<ffffffff810d57a7>] vfs_write+0xb2/0x153
[  269.424003]  [<ffffffff81063bf1>] ? trace_hardirqs_on_caller+0x1f/0x14b
[  269.424003]  [<ffffffff810d590b>] sys_write+0x4a/0x71
[  269.424003]  [<ffffffff810021db>] system_call_fastpath+0x16/0x1b
[  269.424003] Code: 3b 56 10 73 1e 48 83 fa f2 74 18 48 8d 4d cc 4d 89 f8 48 89 df e8 1e f2 ff ff 41 01 c4 83 7d cc 00 
74 19 4d 8b 6d 20 49 83 ed 20 <49> 8b 45 20 0f 18 08 49 8d 45 20 48 39 45 80 75 aa 4c 89 f7 e8 
[  269.424003] RIP  [<ffffffff810c0cb4>] page_referenced+0x147/0x232
[  269.424003]  RSP <ffff88022a1218b8>
[  269.438405] ---[ end trace ad5b4172ee94398e ]---
[  269.438553] note: hib.sh[2617] exited with preempt_count 2
[  269.438709] BUG: scheduling while atomic: hib.sh/2617/0x10000003
[  269.438858] INFO: lockdep is turned off.
[  269.439075] Modules linked in: powernow_k8 cpufreq_ondemand cpufreq_powersave cpufreq_userspace freq_table cpufreq_co
nservative binfmt_misc kvm_amd kvm ipv6 vfat fat dm_crypt dm_mod ohci_hcd pcspkr edac_core k10temp 8250_pnp 8250 serial_core
[  269.440875] Pid: 2617, comm: hib.sh Tainted: G      D W  2.6.34-rc3-00288-gab195c5-dirty #4
[  269.441137] Call Trace:
[  269.441288]  [<ffffffff81063107>] ? __debug_show_held_locks+0x1b/0x24
[  269.441440]  [<ffffffff8102d3c0>] __schedule_bug+0x72/0x77
[  269.441590]  [<ffffffff813f553e>] schedule+0xd9/0x730
[  269.441741]  [<ffffffff8103022c>] __cond_resched+0x18/0x24
[  269.441891]  [<ffffffff813f5c62>] _cond_resched+0x2c/0x37
[  269.442045]  [<ffffffff810b7d7d>] unmap_vmas+0x6ce/0x893
[  269.442205]  [<ffffffff810bc42f>] exit_mmap+0xd7/0x182
[  269.442352]  [<ffffffff81035951>] mmput+0x48/0xb9
[  269.442502]  [<ffffffff81039c21>] exit_mm+0x110/0x11d
[  269.442652]  [<ffffffff8103b663>] do_exit+0x1c5/0x691
[  269.442802]  [<ffffffff81038d0d>] ? kmsg_dump+0x13b/0x155
[  269.442953]  [<ffffffff810060db>] ? oops_end+0x47/0x93
[  269.443107]  [<ffffffff81006122>] oops_end+0x8e/0x93
[  269.443262]  [<ffffffff81006313>] die+0x5a/0x63
[  269.443414]  [<ffffffff81003eaf>] do_general_protection+0x134/0x13c
[  269.443566]  [<ffffffff813f90f0>] ? irq_return+0x0/0x2
[  269.443716]  [<ffffffff813f92cf>] general_protection+0x1f/0x30
[  269.443867]  [<ffffffff810c0cb4>] ? page_referenced+0x147/0x232
[  269.444021]  [<ffffffff810c0bf0>] ? page_referenced+0x83/0x232
[  269.444176]  [<ffffffff810c6051>] ? swapcache_free+0x37/0x3c
[  269.444328]  [<ffffffff810ab79a>] shrink_page_list+0x14a/0x477
[  269.444479]  [<ffffffff813f8c36>] ? _raw_spin_unlock_irq+0x30/0x58
[  269.444630]  [<ffffffff810abe1e>] shrink_inactive_list+0x357/0x5e5
[  269.444782]  [<ffffffff810ac3b6>] shrink_zone+0x30a/0x3d4
[  269.444933]  [<ffffffff810acf91>] do_try_to_free_pages+0x176/0x27f
[  269.445087]  [<ffffffff810ad12f>] shrink_all_memory+0x95/0xc4
[  269.445243]  [<ffffffff810aa634>] ? isolate_pages_global+0x0/0x1f0
[  269.445396]  [<ffffffff81076e64>] ? count_data_pages+0x65/0x79
[  269.445547]  [<ffffffff810770cb>] hibernate_preallocate_memory+0x1aa/0x2cb
[  269.445698]  [<ffffffff813f5135>] ? printk+0x41/0x44
[  269.445848]  [<ffffffff81075a6b>] hibernation_snapshot+0x36/0x1e1
[  269.445999]  [<ffffffff81075ce4>] hibernate+0xce/0x172
[  269.446160]  [<ffffffff81074a51>] state_store+0x5c/0xd3
[  269.446307]  [<ffffffff81185097>] kobj_attr_store+0x17/0x19
[  269.446457]  [<ffffffff81125edb>] sysfs_write_file+0x108/0x144
[  269.446607]  [<ffffffff810d57a7>] vfs_write+0xb2/0x153
[  269.446757]  [<ffffffff81063bf1>] ? trace_hardirqs_on_caller+0x1f/0x14b
[  269.446908]  [<ffffffff810d590b>] sys_write+0x4a/0x71
[  269.447063]  [<ffffffff810021db>] system_call_fastpath+0x16/0x1b


This time we have

[  269.424003] RIP: 0010:[<ffffffff810c0cb4>]  [<ffffffff810c0cb4>] page_referenced+0x147/0x232 

which is offset 0x1104.

which is

    10eb:       48 89 df                mov    %rbx,%rdi
    10ee:       e8 00 00 00 00          callq  10f3 <page_referenced+0x136>
    10f3:       41 01 c4                add    %eax,%r12d
    10f6:       83 7d cc 00             cmpl   $0x0,-0x34(%rbp)
    10fa:       74 19                   je     1115 <page_referenced+0x158>
    10fc:       4d 8b 6d 20             mov    0x20(%r13),%r13
    1100:       49 83 ed 20             sub    $0x20,%r13
    1104:       49 8b 45 20             mov    0x20(%r13),%rax			<-------------------------
    1108:       0f 18 08                prefetcht0 (%rax)
    110b:       49 8d 45 20             lea    0x20(%r13),%rax
    110f:       48 39 45 80             cmp    %rax,-0x80(%rbp)
    1113:       75 aa                   jne    10bf <page_referenced+0x102>
    1115:       4c 89 f7                mov    %r14,%rdi

and asm is

	.loc 1 522 0
	movq	32(%r13), %r13	# <variable>.same_anon_vma.next, __mptr.454
.LVL295:
	subq	$32, %r13	#, avc
.LVL296:
.L186:
.LBE1224:
	movq	32(%r13), %rax	# <variable>.same_anon_vma.next, <variable>.same_anon_vma.next		<--------------
	prefetcht0	(%rax)	# <variable>.same_anon_vma.next
	leaq	32(%r13), %rax	#, tmp104
	cmpq	%rax, -128(%rbp)	# tmp104, %sfp
	jne	.L189	#,
.L188:
	.loc 1 540 0
	movq	%r14, %rdi	# anon_vma,
	call	page_unlock_anon_vma	#

and %r13 contains some funny stuff, could be some mangled SLUB debug
poison or something: R13: 002e2e2e002e2e0e. Maybe this is the reason for
the #GP.

But yes, even if the oopsing instruction is

movq	32(%r13), %rax	# <variable>.same_anon_vma.next, <variable>.same_anon_vma.next

this is not same_anon_vma.next because we've come to the above
instruction through the ".L186:" label, before which we have %r13
already loaded with anon_vma->head.next.

To be continued...

-- 
Regards/Gruss,
    Boris.

^ permalink raw reply related	[flat|nested] 242+ messages in thread

* Re: Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3)
  2010-04-06 21:27                                 ` Linus Torvalds
  2010-04-06 22:59                                   ` Borislav Petkov
@ 2010-04-06 23:22                                   ` Rik van Riel
  2010-04-07  0:10                                     ` Linus Torvalds
  1 sibling, 1 reply; 242+ messages in thread
From: Rik van Riel @ 2010-04-06 23:22 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Borislav Petkov, Andrew Morton, Minchan Kim, KOSAKI Motohiro,
	Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin,
	Andrea Arcangeli, Hugh Dickins, sgunderson

On 04/06/2010 05:27 PM, Linus Torvalds wrote:

> I still don't see _how_ it happens, though. That 'struct anon_vma' is very
> simple, and contains literally just the lock and that list_head.

It gets more fun.  It looks like the anon_vma is only
allocated through anon_vma_alloc() and only handled
by the functions in rmap.c

By themselves, all of those functions look alright.

However, I think I may have found a possible bug in
the interplay between anon_vma_prepare() and vma_adjust(),
across several mprotect invocations.

Let me explain what I think may be going on in small
steps, since it is quite subtle (assuming I am right).

1) a process forks, creating a second "layer" of
    anon_vma objects for the VMAs that have anon pages

2) a new VMA is created adjacant to an existing one,
    with different permissions

3) anon_vma_prepare is called on the new VMA, this
    only links the "top" anon_vma to the new VMA, since
    that is the anon_vma where all new pages get
    instantiated anyway   (this would be part of the bug)

4) mprotect changes the permission of one of the VMAs,
    causing the old and the new VMAs to get merged

5) vma_adjust calls anon_vma_merge, causing the anon_vma
    chain of one of the VMAs to get nuked - with bad luck,
    this is the original one, leaving just the new anon_vma
    attached to the VMA

6) if the parent process quits, the old anon_vma structs
    get freed

7) meanwhile, we may still have some anonymous pages
    stick around in memory that have their page->mapping
    point to a freed anon_vma struct

Does this look like it could happen?

If so, I'll cook up a patch to change anon_vma_prepare
and find_mergeable_anon_vma to attach the whole chain
of anon_vmas to the new VMA, using anon_vma_clone().

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3)
  2010-04-06 22:59                                   ` Borislav Petkov
@ 2010-04-06 23:27                                     ` Linus Torvalds
  2010-04-06 23:54                                       ` [PATCH] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA Rik van Riel
                                                         ` (2 more replies)
  2010-04-06 23:37                                     ` Linus Torvalds
  1 sibling, 3 replies; 242+ messages in thread
From: Linus Torvalds @ 2010-04-06 23:27 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Andrew Morton, Rik van Riel, Minchan Kim, KOSAKI Motohiro,
	Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin,
	Andrea Arcangeli, Hugh Dickins, sgunderson



On Wed, 7 Apr 2010, Borislav Petkov wrote:
> 
> Ok, I tried doing all you suggested and here's what came out. Please,
> take this with a grain of salt because I'm almost falling asleep - even
> the coffee is not working anymore so it could be just as well that I've
> made a mistake somewhere (the new OOPS is a #GP, by the way), just
> watch:

Hey ho, yeah.

The reason it's a #GP fault is that it's not a NULL pointer dereference 
any more, but a wild pointer that is not in the legal region of pointers 
on x86-64. That is also why your debugging code didn't catch it: the 
pointer isn't NULL, so you got the #GP fault on the same old instruction:

	  2b:*	49 8b 45 20          	mov    0x20(%r13),%rax     <-- trapping instruction

for all the same old reasons.

But now %r13 has a non-zero value: 0x002e2e2e002e2e0e, which I do _not_ 
recognize as any of the normal poison values.

> and %r13 contains some funny stuff, could be some mangled SLUB debug
> poison or something: R13: 002e2e2e002e2e0e. Maybe this is the reason for
> the #GP.

Correct. You don't get a page fault if the pointer was totally bogus

> But yes, even if the oopsing instruction is
> 
> movq	32(%r13), %rax	# <variable>.same_anon_vma.next, <variable>.same_anon_vma.next
> 
> this is not same_anon_vma.next because we've come to the above
> instruction through the ".L186:" label, before which we have %r13
> already loaded with anon_vma->head.next.

No, you're mis-reading the asm. It's again the first iteration, and the 
code above it is again the end of the loop. And %rax is once more a kernel 
pointer, not the return value of 'page_referenced_one()'. 

So it once more is 'anon_vma->head.next' that is crap, but now it's not 
NULL, it's that very odd 0x002e2e2e002e2e2e pattern (the %r13 has had 0x20 
subtracted from it, so that LSB of "0x0e" is actually _also_ a 0x2e).

What does '0x2e' mean? It's ASCII '.', but that doesn't really mean 
anything either.

			Linus

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3)
  2010-04-06 22:59                                   ` Borislav Petkov
  2010-04-06 23:27                                     ` Linus Torvalds
@ 2010-04-06 23:37                                     ` Linus Torvalds
  1 sibling, 0 replies; 242+ messages in thread
From: Linus Torvalds @ 2010-04-06 23:37 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Andrew Morton, Rik van Riel, Minchan Kim, KOSAKI Motohiro,
	Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin,
	Andrea Arcangeli, Hugh Dickins, sgunderson



On Wed, 7 Apr 2010, Borislav Petkov wrote:
> +
> +	if (!anon_vma->head.next) {
> +		printk(KERN_ERR "NULL anon_vma->head.next, page %lu\n",
> +				page_to_pfn(page));
> +
> +		object_err(anon_vma_cachep, page, (u8 *)anon_vma, "NULL next");

Oh, and since the debugging code never triggered ('head.next' wasn't 
actually NULL), you never got here, but the 'page' you passed in to 
object_error() should be the page of the slab allocation, not the page 
associated with the anon_vma.

So it should be something like "virt_to_head_page(anon_vma)" that you pass 
in to object_err().

Not that it matters. I assume it is the fact that SLAB debugging is on 
that actually turns the NULL into a non-NULL thing. Poisoning is not 
active for SLUb's with constructors or RCU-freeing, but things like 
redzoning still are. So enabling SLUB debugging will change the offsets 
within the pages of all the SLUB allocations.  I wonder if that's just 
what caused it to now have that 0x002e2e2e002e2e2e instead of NULL.

		Linus

^ permalink raw reply	[flat|nested] 242+ messages in thread

* [PATCH] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA
  2010-04-06 23:27                                     ` Linus Torvalds
@ 2010-04-06 23:54                                       ` Rik van Riel
  2010-04-07  7:00                                         ` KOSAKI Motohiro
  2010-04-07  7:29                                       ` Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3) Borislav Petkov
  2010-04-07 14:05                                       ` Paulo Marques
  2 siblings, 1 reply; 242+ messages in thread
From: Rik van Riel @ 2010-04-06 23:54 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Borislav Petkov, Andrew Morton, Minchan Kim, KOSAKI Motohiro,
	Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin,
	Andrea Arcangeli, Hugh Dickins, sgunderson, hannes

When a new VMA has a mergeable anon_vma with a neighboring VMA,
make sure all of the neighbor's old anon_vma structs are also
linked in.

This is necessary because at some point the VMAs could get merged,
and we want to ensure no anon_vma structs get freed prematurely,
while the system still has anonymous pages that belong to those
structs.

Reported-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Rik van Riel <riel@redhat.com>

--- 
 include/linux/mm.h |    2 +-
 mm/mmap.c          |    6 +++---
 mm/rmap.c          |   20 +++++++++++++-------
 3 files changed, 17 insertions(+), 11 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index e70f21b..90ac50e 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1228,7 +1228,7 @@ extern struct vm_area_struct *vma_merge(struct mm_struct *,
 	struct vm_area_struct *prev, unsigned long addr, unsigned long end,
 	unsigned long vm_flags, struct anon_vma *, struct file *, pgoff_t,
 	struct mempolicy *);
-extern struct anon_vma *find_mergeable_anon_vma(struct vm_area_struct *);
+extern struct vm_area_struct *find_mergeable_anon_vma(struct vm_area_struct *);
 extern int split_vma(struct mm_struct *,
 	struct vm_area_struct *, unsigned long addr, int new_below);
 extern int insert_vm_struct(struct mm_struct *, struct vm_area_struct *);
diff --git a/mm/mmap.c b/mm/mmap.c
index 75557c6..bf0600c 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -832,7 +832,7 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm,
  * anon_vmas being allocated, preventing vma merge in subsequent
  * mprotect.
  */
-struct anon_vma *find_mergeable_anon_vma(struct vm_area_struct *vma)
+struct vm_area_struct *find_mergeable_anon_vma(struct vm_area_struct *vma)
 {
 	struct vm_area_struct *near;
 	unsigned long vm_flags;
@@ -855,7 +855,7 @@ struct anon_vma *find_mergeable_anon_vma(struct vm_area_struct *vma)
 			can_vma_merge_before(near, vm_flags,
 				NULL, vma->vm_file, vma->vm_pgoff +
 				((vma->vm_end - vma->vm_start) >> PAGE_SHIFT)))
-		return near->anon_vma;
+		return near;
 try_prev:
 	/*
 	 * It is potentially slow to have to call find_vma_prev here.
@@ -875,7 +875,7 @@ try_prev:
   			mpol_equal(vma_policy(near), vma_policy(vma)) &&
 			can_vma_merge_after(near, vm_flags,
 				NULL, vma->vm_file, vma->vm_pgoff))
-		return near->anon_vma;
+		return near;
 none:
 	/*
 	 * There's no absolute need to look only at touching neighbours:
diff --git a/mm/rmap.c b/mm/rmap.c
index eaa7a09..60616db 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -119,20 +119,26 @@ int anon_vma_prepare(struct vm_area_struct *vma)
 	might_sleep();
 	if (unlikely(!anon_vma)) {
 		struct mm_struct *mm = vma->vm_mm;
+		struct vm_area_struct *merge_vma;
 		struct anon_vma *allocated;
 
+		merge_vma = find_mergeable_anon_vma(vma);
+		if (merge_vma) {
+			if (anon_vma_clone(vma, merge_vma))
+				goto out_enomem;
+			return 0;
+		}
+
 		avc = anon_vma_chain_alloc();
 		if (!avc)
 			goto out_enomem;
 
-		anon_vma = find_mergeable_anon_vma(vma);
 		allocated = NULL;
-		if (!anon_vma) {
-			anon_vma = anon_vma_alloc();
-			if (unlikely(!anon_vma))
-				goto out_enomem_free_avc;
-			allocated = anon_vma;
-		}
+		anon_vma = anon_vma_alloc();
+		if (unlikely(!anon_vma))
+			goto out_enomem_free_avc;
+		allocated = anon_vma;
+
 		spin_lock(&anon_vma->lock);
 
 		/* page_table_lock to protect against threads */

^ permalink raw reply related	[flat|nested] 242+ messages in thread

* Re: Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3)
  2010-04-06 23:22                                   ` Rik van Riel
@ 2010-04-07  0:10                                     ` Linus Torvalds
  2010-04-07  1:18                                       ` Rik van Riel
  2010-04-07 10:09                                       ` Pekka Enberg
  0 siblings, 2 replies; 242+ messages in thread
From: Linus Torvalds @ 2010-04-07  0:10 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Borislav Petkov, Andrew Morton, Minchan Kim, KOSAKI Motohiro,
	Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin,
	Andrea Arcangeli, Hugh Dickins, sgunderson



On Tue, 6 Apr 2010, Rik van Riel wrote:
> 
> It gets more fun.  It looks like the anon_vma is only
> allocated through anon_vma_alloc() and only handled
> by the functions in rmap.c
> 
> By themselves, all of those functions look alright.

Yes. Very trivially so, in fact.

> However, I think I may have found a possible bug in
> the interplay between anon_vma_prepare() and vma_adjust(),
> across several mprotect invocations.
> 
> Let me explain what I think may be going on in small
> steps, since it is quite subtle (assuming I am right).

Sounds at least possible. Way more likely than any of the "trivially 
obvious" code being buggy, or the SLUB layer suddenly having a serious bug 
that only the new user could trigger.

That said, the code that _really_ confuses me is the stuff that uses 
"anon_vma_clone()". Could you please also explain the code flow of 
vma_adjust() to mere mortals, please?

I suspect Borislav is sleeping. But at least we have a patch for him to 
test when he wakes up ;)

		Linus

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3)
  2010-04-07  0:10                                     ` Linus Torvalds
@ 2010-04-07  1:18                                       ` Rik van Riel
  2010-04-07  7:22                                         ` Borislav Petkov
  2010-04-07 10:09                                       ` Pekka Enberg
  1 sibling, 1 reply; 242+ messages in thread
From: Rik van Riel @ 2010-04-07  1:18 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Borislav Petkov, Andrew Morton, Minchan Kim, KOSAKI Motohiro,
	Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin,
	Andrea Arcangeli, Hugh Dickins, sgunderson

On 04/06/2010 08:10 PM, Linus Torvalds wrote:

> That said, the code that _really_ confuses me is the stuff that uses
> "anon_vma_clone()". Could you please also explain the code flow of
> vma_adjust() to mere mortals, please?

That's easier said than done.  I spent 3 days with pen and paper,
going over that code before I made the anon_vma changes, first
verifying that the code is indeed correct and then figuring out
how I could make the anon_vma changes safely.

I am not happy with the complexity of the code around vma_adjust,
but could not find a way to simplify it and still keep merging
VMAs the way we do.

My largest change to vma_adjust was moving some code closer to
the beginning of the function, so I could bail out if the
allocation failed, without making change to the vma...

> I suspect Borislav is sleeping. But at least we have a patch for him to
> test when he wakes up ;)

I am looking forward to the test results.


^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: [PATCH] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA
  2010-04-06 23:54                                       ` [PATCH] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA Rik van Riel
@ 2010-04-07  7:00                                         ` KOSAKI Motohiro
  2010-04-07 14:48                                           ` Rik van Riel
  2010-04-07 14:54                                           ` [PATCH -v2] " Rik van Riel
  0 siblings, 2 replies; 242+ messages in thread
From: KOSAKI Motohiro @ 2010-04-07  7:00 UTC (permalink / raw)
  To: Rik van Riel
  Cc: kosaki.motohiro, Linus Torvalds, Borislav Petkov, Andrew Morton,
	Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn,
	Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson, hannes

> When a new VMA has a mergeable anon_vma with a neighboring VMA,
> make sure all of the neighbor's old anon_vma structs are also
> linked in.
> 
> This is necessary because at some point the VMAs could get merged,
> and we want to ensure no anon_vma structs get freed prematurely,
> while the system still has anonymous pages that belong to those
> structs.

Ahhhh, I'm shame myself. sure, neighbor vma might have lots avc ;-)
few comments are blow.

> 
> Reported-by: Borislav Petkov <bp@alien8.de>
> Signed-off-by: Rik van Riel <riel@redhat.com>
> 
> --- 
>  include/linux/mm.h |    2 +-
>  mm/mmap.c          |    6 +++---
>  mm/rmap.c          |   20 +++++++++++++-------
>  3 files changed, 17 insertions(+), 11 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index e70f21b..90ac50e 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1228,7 +1228,7 @@ extern struct vm_area_struct *vma_merge(struct mm_struct *,
>  	struct vm_area_struct *prev, unsigned long addr, unsigned long end,
>  	unsigned long vm_flags, struct anon_vma *, struct file *, pgoff_t,
>  	struct mempolicy *);
> -extern struct anon_vma *find_mergeable_anon_vma(struct vm_area_struct *);
> +extern struct vm_area_struct *find_mergeable_anon_vma(struct vm_area_struct *);
>  extern int split_vma(struct mm_struct *,
>  	struct vm_area_struct *, unsigned long addr, int new_below);
>  extern int insert_vm_struct(struct mm_struct *, struct vm_area_struct *);
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 75557c6..bf0600c 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -832,7 +832,7 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm,
>   * anon_vmas being allocated, preventing vma merge in subsequent
>   * mprotect.
>   */
> -struct anon_vma *find_mergeable_anon_vma(struct vm_area_struct *vma)
> +struct vm_area_struct *find_mergeable_anon_vma(struct vm_area_struct *vma)
>  {
>  	struct vm_area_struct *near;
>  	unsigned long vm_flags;
> @@ -855,7 +855,7 @@ struct anon_vma *find_mergeable_anon_vma(struct vm_area_struct *vma)
>  			can_vma_merge_before(near, vm_flags,
>  				NULL, vma->vm_file, vma->vm_pgoff +
>  				((vma->vm_end - vma->vm_start) >> PAGE_SHIFT)))
> -		return near->anon_vma;
> +		return near;
>  try_prev:
>  	/*
>  	 * It is potentially slow to have to call find_vma_prev here.
> @@ -875,7 +875,7 @@ try_prev:
>    			mpol_equal(vma_policy(near), vma_policy(vma)) &&
>  			can_vma_merge_after(near, vm_flags,
>  				NULL, vma->vm_file, vma->vm_pgoff))
> -		return near->anon_vma;
> +		return near;
>  none:
>  	/*
>  	 * There's no absolute need to look only at touching neighbours:
> diff --git a/mm/rmap.c b/mm/rmap.c
> index eaa7a09..60616db 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -119,20 +119,26 @@ int anon_vma_prepare(struct vm_area_struct *vma)
>  	might_sleep();
>  	if (unlikely(!anon_vma)) {
>  		struct mm_struct *mm = vma->vm_mm;
> +		struct vm_area_struct *merge_vma;
>  		struct anon_vma *allocated;
>  
> +		merge_vma = find_mergeable_anon_vma(vma);
> +		if (merge_vma) {
> +			if (anon_vma_clone(vma, merge_vma))
> +				goto out_enomem;
> +			return 0;
> +		}
> +

Hmm.. probably I'm moron.
I'm also confusing this locking rule as same as linus said.

after this patch, new locking order are

	down_read(mmap_sem)
		anon_vma_clone(vma, merge_vma)
			list_add(&avc->same_vma, &vma->anon_vma_chain);
			spin_lock(&anon_vma->lock);
			list_add_tail(&avc->same_anon_vma, &anon_vma->head);
			spin_unlock(&anon_vma->lock);
		spin_lock(&anon_vma->lock);
		spin_lock(&mm->page_table_lock);

So, Why mmap_sem read lock can protect vma->anon_vma_chain?
An another threads seems to be able to change avc list concurrentlly and freely.

plus, Why don't we need "vma->anon_vma = merge_vma->anon_vma" assignment?
if vma->anon_vma keep NULL, I think anon_vma_prepare() call anon_vma_clone() 
multiple times.


>  		avc = anon_vma_chain_alloc();
>  		if (!avc)
>  			goto out_enomem;
>  
> -		anon_vma = find_mergeable_anon_vma(vma);
>  		allocated = NULL;
> -		if (!anon_vma) {
> -			anon_vma = anon_vma_alloc();
> -			if (unlikely(!anon_vma))
> -				goto out_enomem_free_avc;
> -			allocated = anon_vma;
> -		}
> +		anon_vma = anon_vma_alloc();
> +		if (unlikely(!anon_vma))
> +			goto out_enomem_free_avc;
> +		allocated = anon_vma;
> +
>  		spin_lock(&anon_vma->lock);
>  
>  		/* page_table_lock to protect against threads */




^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3)
  2010-04-07  1:18                                       ` Rik van Riel
@ 2010-04-07  7:22                                         ` Borislav Petkov
  0 siblings, 0 replies; 242+ messages in thread
From: Borislav Petkov @ 2010-04-07  7:22 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Linus Torvalds, Andrew Morton, Minchan Kim, KOSAKI Motohiro,
	Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin,
	Andrea Arcangeli, Hugh Dickins, sgunderson

From: Rik van Riel <riel@redhat.com>
Date: Tue, Apr 06, 2010 at 09:18:28PM -0400

Hi Rik,

I think your patch needs a bit more baking, see below :)

> >I suspect Borislav is sleeping. But at least we have a patch for him to
> >test when he wakes up ;)
> 
> I am looking forward to the test results.

This happens when starting X, I haven't even started hibernating.

  [By the way, further testing will have to wait till tonight since I
   have a job, you know :) ]

Also, mm/rmap.c:745 is

	BUG_ON(!anon_vma);

in __page_set_anon_rmap().

---
[   43.142371] ------------[ cut here ]------------
[   43.142411] kernel BUG at mm/rmap.c:745!
[   43.142436] invalid opcode: 0000 [#1] PREEMPT SMP 
[   43.142514] last sysfs file: /sys/devices/virtual/vtconsole/vtcon0/uevent
[   43.142537] CPU 0 
[   43.142559] Modules linked in: powernow_k8 cpufreq_ondemand cpufreq_powersave cpufreq_userspace freq_table cpufreq_conservative binfmt_misc kvm_amd kvm ipv6 vfat fat dm_crypt dm_mod 8250_pnp 8250 edac_core serial_core k10temp ohci_hcd pcspkr
[   43.142997] 
[   43.143012] Pid: 1940, comm: console-kit-dae Not tainted 2.6.34-rc3-00289-gae1ed76 #5 M3A78 PRO/System Product Name
[   43.143012] RIP: 0010:[<ffffffff810c08e7>]  [<ffffffff810c08e7>] page_add_new_anon_rmap+0x3b/0x89
[   43.143012] RSP: 0000:ffff88022c019da8  EFLAGS: 00010246
[   43.143012] RAX: 0000000000000000 RBX: ffffea000774ff78 RCX: 000000002ce900f4
[   43.143012] RDX: ffff88000a1d5dc8 RSI: 0000000000000007 RDI: ffffffff816e8740
[   43.143012] RBP: ffff88022c019dc8 R08: 00007f29e3cfd928 R09: 000000000062c318
[   43.143012] R10: 0000000000000000 R11: 0000000000000002 R12: ffff88022bbad960
[   43.143012] R13: 00007f29e3cfd928 R14: 00007f29e3cfd928 R15: 80000002216d9067
[   43.143012] FS:  00007f29e3d0f790(0000) GS:ffff88000a000000(0000) knlGS:0000000000000000
[   43.143012] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   43.143012] CR2: 00007f29e3cfd928 CR3: 000000022dfd3000 CR4: 00000000000006f0
[   43.143012] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   43.143012] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[   43.143012] Process console-kit-dae (pid: 1940, threadinfo ffff88022c018000, task ffff88022ce90000)
[   43.143012] Stack:
[   43.143012]  ffffffff810b8802 ffff88022bbad960 ffff88022ea3c600 ffff88022bb6d7e8
[   43.143012] <0> ffff88022c019e48 ffffffff810b8823 ffff88022ea3c6b8 0000000000000246
[   43.143012] <0> ffffea000774ff78 0000000000000001 00000001e3cfd928 ffff88022fdb58f0
[   43.143012] Call Trace:
[   43.143012]  [<ffffffff810b8802>] ? handle_mm_fault+0x2af/0x64e
[   43.143012]  [<ffffffff810b8823>] handle_mm_fault+0x2d0/0x64e
[   43.143012]  [<ffffffff8101f392>] do_page_fault+0x30b/0x32d
[   43.143012]  [<ffffffff810615ce>] ? put_lock_stats+0xe/0x27
[   43.143012]  [<ffffffff81062a55>] ? lock_release_holdtime+0x104/0x109
[   43.143012]  [<ffffffff813f93e3>] ? error_sti+0x5/0x6
[   43.143012]  [<ffffffff813f7de2>] ? trace_hardirqs_off_thunk+0x3a/0x3c
[   43.143012]  [<ffffffff813f91ff>] page_fault+0x1f/0x30
[   43.143012] Code: 00 00 48 89 fb 49 89 f4 49 89 d5 f0 80 4f 02 10 be 07 00 00 00 c7 47 0c 00 00 00 00 e8 c5 30 ff ff 49 8b 44 24 78 48 85 c0 75 04 <0f> 0b eb fe 48 ff c0 4c 89 e6 48 89 df 48 89 43 18 4d 2b 6c 24 
[   43.143012] RIP  [<ffffffff810c08e7>] page_add_new_anon_rmap+0x3b/0x89
[   43.143012]  RSP <ffff88022c019da8>
[   43.145276] ---[ end trace d6305f6e826dbd53 ]---
[   43.145314] note: console-kit-dae[1940] exited with preempt_count 1
[   73.644201] ------------[ cut here ]------------
[   73.644218] kernel BUG at mm/rmap.c:745!
[   73.644226] invalid opcode: 0000 [#2] PREEMPT SMP 
[   73.644266] last sysfs file: /sys/devices/system/cpu/cpu3/cpufreq/scaling_cur_freq
[   73.644278] CPU 0 
[   73.644287] Modules linked in: powernow_k8 cpufreq_ondemand cpufreq_powersave cpufreq_userspace freq_table cpufreq_conservative binfmt_misc kvm_amd kvm ipv6 vfat fat dm_crypt dm_mod 8250_pnp 8250 edac_core serial_core k10temp ohci_hcd pcspkr
[   73.644509] 
[   73.644520] Pid: 2018, comm: iceowl-bin Tainted: G      D    2.6.34-rc3-00289-gae1ed76 #5 M3A78 PRO/System Product Name
[   73.644534] RIP: 0010:[<ffffffff810c08e7>]  [<ffffffff810c08e7>] page_add_new_anon_rmap+0x3b/0x89
[   73.644553] RSP: 0000:ffff88022cd37da8  EFLAGS: 00010246
[   73.644562] RAX: 0000000000000000 RBX: ffffea000764dfa8 RCX: 0000000000000002
[   73.644572] RDX: ffff88000a1d5dc8 RSI: 0000000000000007 RDI: ffffffff816e8740
[   73.644589] RBP: ffff88022cd37dc8 R08: 00007f2ce0aab928 R09: 0000000000000000
[   73.644603] R10: 0000000000000000 R11: 000000000011da32 R12: ffff88022d5894b0
[   73.644615] R13: 00007f2ce0aab928 R14: 00007f2ce0aab928 R15: 800000021cd23067
[   73.644628] FS:  00007f2cee88b7b0(0000) GS:ffff88000a000000(0000) knlGS:0000000000000000
[   73.644639] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   73.644652] CR2: 00007f2ce0aab928 CR3: 000000022b1b5000 CR4: 00000000000006f0
[   73.644664] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   73.644675] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[   73.644690] Process iceowl-bin (pid: 2018, threadinfo ffff88022cd36000, task ffff88022a74a5c0)
[   73.644701] Stack:
[   73.644708]  ffffffff810b8802 ffff88022d5894b0 ffff88022ce41e00 ffff88022d4b0558
[   73.644745] <0> ffff88022cd37e48 ffffffff810b8823 ffff88022ce41eb8 0000000000000246
[   73.644801] <0> ffffea000764dfa8 0000000000000001 00000001e0aab928 ffff88022c0a4828
[   73.644862] Call Trace:
[   73.644874]  [<ffffffff810b8802>] ? handle_mm_fault+0x2af/0x64e
[   73.644885]  [<ffffffff810b8823>] handle_mm_fault+0x2d0/0x64e
[   73.644895]  [<ffffffff8101f392>] do_page_fault+0x30b/0x32d
[   73.644909]  [<ffffffff810be3c2>] ? do_mmap_pgoff+0x290/0x2f3
[   73.644921]  [<ffffffff813f93e3>] ? error_sti+0x5/0x6
[   73.644932]  [<ffffffff81062b97>] ? trace_hardirqs_off_caller+0x1f/0xa9
[   73.644943]  [<ffffffff813f7de2>] ? trace_hardirqs_off_thunk+0x3a/0x3c
[   73.644952]  [<ffffffff813f91ff>] page_fault+0x1f/0x30
[   73.644963] Code: 00 00 48 89 fb 49 89 f4 49 89 d5 f0 80 4f 02 10 be 07 00 00 00 c7 47 0c 00 00 00 00 e8 c5 30 ff ff 49 8b 44 24 78 48 85 c0 75 04 <0f> 0b eb fe 48 ff c0 4c 89 e6 48 89 df 48 89 43 18 4d 2b 6c 24 
[   73.645001] RIP  [<ffffffff810c08e7>] page_add_new_anon_rmap+0x3b/0x89
[   73.645001]  RSP <ffff88022cd37da8>
[   73.645610] ---[ end trace d6305f6e826dbd54 ]---
[   73.645621] note: iceowl-bin[2018] exited with preempt_count 1
[   77.562222] SysRq : HELP : loglevel(0-9) reBoot Crash show-all-locks(D) terminate-all-tasks(E) memory-full-oom-kill(F) kill-all-tasks(I) thaw-filesystems(J) saK show-backtrace-all-active-cpus(L) show-memory-usage(M) nice-all-RT-tasks(N) powerOff show-registers(P) show-all-timers(Q) unRaw Sync show-task-states(T) Unmount show-blocked-tasks(W) dump-ftrace-buffer(Z) 
[   78.014120] SysRq : Emergency Sync
[   78.016864] Emergency Sync complete
[   78.585045] SysRq : Emergency Remount R/O
[   78.663367] Emergency Remount complete
[   79.098126] SysRq : Resetting

-- 
Regards/Gruss,
Boris.

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3)
  2010-04-06 23:27                                     ` Linus Torvalds
  2010-04-06 23:54                                       ` [PATCH] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA Rik van Riel
@ 2010-04-07  7:29                                       ` Borislav Petkov
  2010-04-07 14:05                                       ` Paulo Marques
  2 siblings, 0 replies; 242+ messages in thread
From: Borislav Petkov @ 2010-04-07  7:29 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, Rik van Riel, Minchan Kim, KOSAKI Motohiro,
	Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin,
	Andrea Arcangeli, Hugh Dickins, sgunderson

From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Tue, Apr 06, 2010 at 04:27:42PM -0700

> No, you're mis-reading the asm. It's again the first iteration, and the 
> code above it is again the end of the loop. And %rax is once more a kernel 
> pointer, not the return value of 'page_referenced_one()'. 
> 
> So it once more is 'anon_vma->head.next' that is crap, but now it's not 
> NULL, it's that very odd 0x002e2e2e002e2e2e pattern (the %r13 has had 0x20 
> subtracted from it, so that LSB of "0x0e" is actually _also_ a 0x2e).

No, maybe I expressed myself wrong (it was late an' all) - I was
basically trying to confirm your assessment that anon_vma->head.next
is crap but the code had changed since I had added the debugging 'if
(!anon_vma->head.next)' and that was the value that was already in %r13
before iterating over the list chain.

Yeah, just a minor nitpick and not that it matters. Nevermind though,
we're on the same page.

-- 
Regards/Gruss,
Boris.

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3)
  2010-04-06 18:28                       ` Linus Torvalds
  2010-04-06 19:03                         ` Andrew Morton
@ 2010-04-07  8:36                         ` Peter Zijlstra
  2010-04-07  9:16                           ` Johannes Weiner
                                             ` (2 more replies)
  1 sibling, 3 replies; 242+ messages in thread
From: Peter Zijlstra @ 2010-04-07  8:36 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Rik van Riel, Minchan Kim, KOSAKI Motohiro, Borislav Petkov,
	Andrew Morton, Linux Kernel Mailing List, Lee Schermerhorn,
	Nick Piggin, Andrea Arcangeli, Hugh Dickins

On Tue, 2010-04-06 at 11:28 -0700, Linus Torvalds wrote:
> Just as an example of the kind of code that makes me worry:
> 
>         void unlink_anon_vmas(struct vm_area_struct *vma)
>         {
>                 struct anon_vma_chain *avc, *next;
>                         
>                 /* Unlink each anon_vma chained to the VMA. */
>                 list_for_each_entry_safe(avc, next, &vma->anon_vma_chain, same_vma) {
>                         anon_vma_unlink(avc);
>                         list_del(&avc->same_vma);
>                         anon_vma_chain_free(avc);
>                 }
>         }
> 
> Now, think about what happens for the *last* entry in that avc chain. It 
> will call that "anon_vma_unlink()" thing, which will delete perhaps the 
> last entry in the "same_anon_vma" one, and then it does
> 
>         if (empty)
>                 anon_vma_free(anon_vma);
> 
> *before* unlink_anon_vma's has actually does that
> 
>         list_del(&avc->same_vma);
> 
> and what we essentially have is a stale anon_vma_chain entry that still 
> exists on that same_vma list, and points to an anon_vma that already got 
> deleted.
> 
> Does it matter? I really can't see that it does. 

I think it does, the anon_vma thing has an RCU destroyed slab, but that
doesn't mean the anon_vma object itself is rcu delayed. The moment we
free it it can be re-used. So the above use after free is a bug.

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3)
  2010-04-06 15:55           ` Linus Torvalds
  2010-04-06 16:23             ` Minchan Kim
@ 2010-04-07  8:37             ` Peter Zijlstra
  1 sibling, 0 replies; 242+ messages in thread
From: Peter Zijlstra @ 2010-04-07  8:37 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Minchan Kim, Rik van Riel, KOSAKI Motohiro, Borislav Petkov,
	Andrew Morton, Linux Kernel Mailing List, Lee Schermerhorn,
	Nick Piggin, Andrea Arcangeli, Hugh Dickins

On Tue, 2010-04-06 at 08:55 -0700, Linus Torvalds wrote:
> I do wonder if "page_lock_anon_vma()" should check the whole 
> "page_mapped()" case _after_ taking the anon_vma lock. Because if the race 
> happens, we're following a anon_vma list that has nothing to do with that 
> page (it's stilla _valid_ list, since we locked the anon_vma, but will it 
> be ok?)
> 
> IOW, what is it that really keeps the anon_vma list reliable _and_ 
> relevant wrt the page? We know we may get a stale anon_vma, are we ok if 
> that anon_vma list doesn't actually have anything to do with the page any 
> more? 

When doing the whole make i_mmap_lock/anon_vma->lock a mutex thing last
week I ran into the same issue and its on my todo list to find out wth
is happening there. 

So yes I think we should move that validation check inside
page_lock_anon_vma().

I'll cook up a patch once I'm done staring at the various funny arch
mmu_gather implementations.

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3)
  2010-04-06 20:02                             ` Linus Torvalds
  2010-04-06 20:46                               ` Steinar H. Gunderson
  2010-04-06 20:51                               ` Borislav Petkov
@ 2010-04-07  8:41                               ` Peter Zijlstra
  2 siblings, 0 replies; 242+ messages in thread
From: Peter Zijlstra @ 2010-04-07  8:41 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Borislav Petkov, Andrew Morton, Rik van Riel, Minchan Kim,
	KOSAKI Motohiro, Linux Kernel Mailing List, Lee Schermerhorn,
	Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson

On Tue, 2010-04-06 at 13:02 -0700, Linus Torvalds wrote:
>  - Related to the above: perhaps the RCU freeing isn't working, or 
>    slub/slab/slob ends up reusing the allocations for something else than 
>    anonvma's, so together with the race _and_ an unlucky re-use, you get 
>    some odd crud.
> 
>    I haven't looked at the kernel config files: do they perhaps share the 
>    same (odd?) SLUB/SLAB/SLOB config? 

Right, so anon_vma uses SLAB_DESTROY_BY_RCU and as the huge comment in
rmap.c explains, that doesn't mean the objects themself get RCU grace
period delays in freeing, only the SLAB that backs these objects does.

So the moment you do kmem_cache_free() on the anon_vma it can be re-used
for another allocation. The only guarantee given by RCU is that the
backing storage doesn't go away and hence you can 'safely' deref
pointers, you still very much have to revalidate you got the object you
were looking for.

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3)
  2010-04-07  8:36                         ` Peter Zijlstra
@ 2010-04-07  9:16                           ` Johannes Weiner
  2010-04-07  9:37                             ` Peter Zijlstra
  2010-04-07 14:12                           ` Rik van Riel
  2010-04-07 15:46                           ` Linus Torvalds
  2 siblings, 1 reply; 242+ messages in thread
From: Johannes Weiner @ 2010-04-07  9:16 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Rik van Riel, Minchan Kim, KOSAKI Motohiro,
	Borislav Petkov, Andrew Morton, Linux Kernel Mailing List,
	Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins

On Wed, Apr 07, 2010 at 10:36:43AM +0200, Peter Zijlstra wrote:
> On Tue, 2010-04-06 at 11:28 -0700, Linus Torvalds wrote:
> > Just as an example of the kind of code that makes me worry:
> > 
> >         void unlink_anon_vmas(struct vm_area_struct *vma)
> >         {
> >                 struct anon_vma_chain *avc, *next;
> >                         
> >                 /* Unlink each anon_vma chained to the VMA. */
> >                 list_for_each_entry_safe(avc, next, &vma->anon_vma_chain, same_vma) {
> >                         anon_vma_unlink(avc);
> >                         list_del(&avc->same_vma);
> >                         anon_vma_chain_free(avc);
> >                 }
> >         }
> > 
> > Now, think about what happens for the *last* entry in that avc chain. It 
> > will call that "anon_vma_unlink()" thing, which will delete perhaps the 
> > last entry in the "same_anon_vma" one, and then it does
> > 
> >         if (empty)
> >                 anon_vma_free(anon_vma);
> > 
> > *before* unlink_anon_vma's has actually does that
> > 
> >         list_del(&avc->same_vma);
> > 
> > and what we essentially have is a stale anon_vma_chain entry that still 
> > exists on that same_vma list, and points to an anon_vma that already got 
> > deleted.
> > 
> > Does it matter? I really can't see that it does. 
> 
> I think it does, the anon_vma thing has an RCU destroyed slab, but that
> doesn't mean the anon_vma object itself is rcu delayed. The moment we
> free it it can be re-used. So the above use after free is a bug.

It frees avc->anon_vma, not avc.  So the sequence is

	free(avc->anon_vma) in anon_vma_unlink()
	list_del(&avc->same_vma) in unlink_anon_vmas()

It's not a use-after free.  A problem would be if somebody should find the
avc through this list (it is the vma->anon_vma_chain list) when its anon_vma
pointer is invalid.

I don't think this can happen, however.  Both the unlinking and the looking
at the list happen under vma->vm_mm's mmap_sem held for writing.

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3)
  2010-04-07  9:16                           ` Johannes Weiner
@ 2010-04-07  9:37                             ` Peter Zijlstra
  0 siblings, 0 replies; 242+ messages in thread
From: Peter Zijlstra @ 2010-04-07  9:37 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Linus Torvalds, Rik van Riel, Minchan Kim, KOSAKI Motohiro,
	Borislav Petkov, Andrew Morton, Linux Kernel Mailing List,
	Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins

On Wed, 2010-04-07 at 11:16 +0200, Johannes Weiner wrote:
> On Wed, Apr 07, 2010 at 10:36:43AM +0200, Peter Zijlstra wrote:
> > On Tue, 2010-04-06 at 11:28 -0700, Linus Torvalds wrote:
> > > Just as an example of the kind of code that makes me worry:
> > > 
> > >         void unlink_anon_vmas(struct vm_area_struct *vma)
> > >         {
> > >                 struct anon_vma_chain *avc, *next;
> > >                         
> > >                 /* Unlink each anon_vma chained to the VMA. */
> > >                 list_for_each_entry_safe(avc, next, &vma->anon_vma_chain, same_vma) {
> > >                         anon_vma_unlink(avc);
> > >                         list_del(&avc->same_vma);
> > >                         anon_vma_chain_free(avc);
> > >                 }
> > >         }
> > > 
> > > Now, think about what happens for the *last* entry in that avc chain. It 
> > > will call that "anon_vma_unlink()" thing, which will delete perhaps the 
> > > last entry in the "same_anon_vma" one, and then it does
> > > 
> > >         if (empty)
> > >                 anon_vma_free(anon_vma);
> > > 
> > > *before* unlink_anon_vma's has actually does that
> > > 
> > >         list_del(&avc->same_vma);
> > > 
> > > and what we essentially have is a stale anon_vma_chain entry that still 
> > > exists on that same_vma list, and points to an anon_vma that already got 
> > > deleted.
> > > 
> > > Does it matter? I really can't see that it does. 
> > 
> > I think it does, the anon_vma thing has an RCU destroyed slab, but that
> > doesn't mean the anon_vma object itself is rcu delayed. The moment we
> > free it it can be re-used. So the above use after free is a bug.
> 
> It frees avc->anon_vma, not avc.

Sure, freeing avc does not involve RCU in any way.

>   So the sequence is
> 
> 	free(avc->anon_vma) in anon_vma_unlink()
> 	list_del(&avc->same_vma) in unlink_anon_vmas()
> 
> It's not a use-after free.  A problem would be if somebody should find the
> avc through this list (it is the vma->anon_vma_chain list) when its anon_vma
> pointer is invalid.
> 
> I don't think this can happen, however.  Both the unlinking and the looking
> at the list happen under vma->vm_mm's mmap_sem held for writing.

What I was worried about was it freeing anon_vma and then still having
the avc on list. But I guess that cannot happen because it only frees if
its actually empty.



^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: Ugly rmap NULL ptr deref oopsie on hibernate (was Linux  2.6.34-rc3)
  2010-04-07  0:10                                     ` Linus Torvalds
  2010-04-07  1:18                                       ` Rik van Riel
@ 2010-04-07 10:09                                       ` Pekka Enberg
  2010-04-07 10:12                                         ` KOSAKI Motohiro
  1 sibling, 1 reply; 242+ messages in thread
From: Pekka Enberg @ 2010-04-07 10:09 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Rik van Riel, Borislav Petkov, Andrew Morton, Minchan Kim,
	KOSAKI Motohiro, Linux Kernel Mailing List, Lee Schermerhorn,
	Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson,
	Christoph Lameter, Tejun Heo

Hi Linus,

On Wed, Apr 7, 2010 at 3:10 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> Sounds at least possible. Way more likely than any of the "trivially
> obvious" code being buggy, or the SLUB layer suddenly having a serious bug
> that only the new user could trigger.

I haven't followed the discussion at all but if someone wants to
investigate that angle more, the most likely suspect are the recent
per-cpu changes. That said, I'd expect the problem to be more
widespread if SLUB is to blame here.

                        Pekka

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: Ugly rmap NULL ptr deref oopsie on hibernate (was Linux  2.6.34-rc3)
  2010-04-07 10:09                                       ` Pekka Enberg
@ 2010-04-07 10:12                                         ` KOSAKI Motohiro
  0 siblings, 0 replies; 242+ messages in thread
From: KOSAKI Motohiro @ 2010-04-07 10:12 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: kosaki.motohiro, Linus Torvalds, Rik van Riel, Borislav Petkov,
	Andrew Morton, Minchan Kim, Linux Kernel Mailing List,
	Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins,
	sgunderson, Christoph Lameter, Tejun Heo

> Hi Linus,
> 
> On Wed, Apr 7, 2010 at 3:10 AM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> > Sounds at least possible. Way more likely than any of the "trivially
> > obvious" code being buggy, or the SLUB layer suddenly having a serious bug
> > that only the new user could trigger.
> 
> I haven't followed the discussion at all but if someone wants to
> investigate that angle more, the most likely suspect are the recent
> per-cpu changes. That said, I'd expect the problem to be more
> widespread if SLUB is to blame here.

Nope. We don't doubt SLUB nor per-cpu anymore. Rik found the bug in his patch.

thanks.




^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3)
  2010-04-06 23:27                                     ` Linus Torvalds
  2010-04-06 23:54                                       ` [PATCH] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA Rik van Riel
  2010-04-07  7:29                                       ` Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3) Borislav Petkov
@ 2010-04-07 14:05                                       ` Paulo Marques
  2010-04-07 14:13                                         ` Borislav Petkov
  2 siblings, 1 reply; 242+ messages in thread
From: Paulo Marques @ 2010-04-07 14:05 UTC (permalink / raw)
  To: Linux Kernel Mailing List
  Cc: Borislav Petkov, Andrew Morton, Rik van Riel, Minchan Kim,
	KOSAKI Motohiro, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli,
	Hugh Dickins, sgunderson

Linus Torvalds wrote:
> [...]
> So it once more is 'anon_vma->head.next' that is crap, but now it's not 
> NULL, it's that very odd 0x002e2e2e002e2e2e pattern (the %r13 has had 0x20 
> subtracted from it, so that LSB of "0x0e" is actually _also_ a 0x2e).
> 
> What does '0x2e' mean? It's ASCII '.', but that doesn't really mean 
> anything either.

Just a wild shot in the dark: it can be a couple of gray pixels with
intensity 0x2e at some 32 bits per pixel mode. I say this because of the
zero bytes there and someone mentioning seeing the problem when starting X.

-- 
Paulo Marques - www.grupopie.com

"Don't worry, you'll be fine; I saw it work in a cartoon once..."

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3)
  2010-04-07  8:36                         ` Peter Zijlstra
  2010-04-07  9:16                           ` Johannes Weiner
@ 2010-04-07 14:12                           ` Rik van Riel
  2010-04-07 15:46                           ` Linus Torvalds
  2 siblings, 0 replies; 242+ messages in thread
From: Rik van Riel @ 2010-04-07 14:12 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Minchan Kim, KOSAKI Motohiro, Borislav Petkov,
	Andrew Morton, Linux Kernel Mailing List, Lee Schermerhorn,
	Nick Piggin, Andrea Arcangeli, Hugh Dickins

On 04/07/2010 04:36 AM, Peter Zijlstra wrote:
> On Tue, 2010-04-06 at 11:28 -0700, Linus Torvalds wrote:

>>          if (empty)
>>                  anon_vma_free(anon_vma);
>>
>> *before* unlink_anon_vma's has actually does that
>>
>>          list_del(&avc->same_vma);
>>
>> and what we essentially have is a stale anon_vma_chain entry that still
>> exists on that same_vma list, and points to an anon_vma that already got
>> deleted.
>>
>> Does it matter? I really can't see that it does.
>
> I think it does, the anon_vma thing has an RCU destroyed slab, but that
> doesn't mean the anon_vma object itself is rcu delayed. The moment we
> free it it can be re-used. So the above use after free is a bug.

Peter, the avc is an anon_vma_chain, which is a different
object than the anon_vma itself.  There is no use after free
of an anon_vma object in unlink_anon_vmas + anon_vma_unlink.

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3)
  2010-04-07 14:05                                       ` Paulo Marques
@ 2010-04-07 14:13                                         ` Borislav Petkov
  0 siblings, 0 replies; 242+ messages in thread
From: Borislav Petkov @ 2010-04-07 14:13 UTC (permalink / raw)
  To: Paulo Marques
  Cc: Linux Kernel Mailing List, Borislav Petkov, Andrew Morton,
	Rik van Riel, Minchan Kim, KOSAKI Motohiro, Lee Schermerhorn,
	Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson

From: Paulo Marques <pmarques@grupopie.com>
Date: Wed, Apr 07, 2010 at 03:05:50PM +0100

> Linus Torvalds wrote:
> > [...]
> > So it once more is 'anon_vma->head.next' that is crap, but now it's not 
> > NULL, it's that very odd 0x002e2e2e002e2e2e pattern (the %r13 has had 0x20 
> > subtracted from it, so that LSB of "0x0e" is actually _also_ a 0x2e).
> > 
> > What does '0x2e' mean? It's ASCII '.', but that doesn't really mean 
> > anything either.
> 
> Just a wild shot in the dark: it can be a couple of gray pixels with
> intensity 0x2e at some 32 bits per pixel mode. I say this because of the
> zero bytes there and someone mentioning seeing the problem when starting X.

I don't think those are related: the problem when X was starting happens
with Rik' newest patch and the funny %r13 value happened after enabling
SLUB debugging last night.

Thanks.

-- 
Regards/Gruss,
Boris.

--
Advanced Micro Devices, Inc.
Operating Systems Research Center

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: [PATCH] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA
  2010-04-07  7:00                                         ` KOSAKI Motohiro
@ 2010-04-07 14:48                                           ` Rik van Riel
  2010-04-07 14:54                                           ` [PATCH -v2] " Rik van Riel
  1 sibling, 0 replies; 242+ messages in thread
From: Rik van Riel @ 2010-04-07 14:48 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Linus Torvalds, Borislav Petkov, Andrew Morton, Minchan Kim,
	Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin,
	Andrea Arcangeli, Hugh Dickins, sgunderson, hannes

On 04/07/2010 03:00 AM, KOSAKI Motohiro wrote:

> Hmm.. probably I'm moron.

Someone might be, but it's not you :)

> I'm also confusing this locking rule as same as linus said.
>
> after this patch, new locking order are

> So, Why mmap_sem read lock can protect vma->anon_vma_chain?
> An another threads seems to be able to change avc list concurrentlly and freely.

You are right, the code needs to take the pagetable_lock
around the call to anon_vma_clone, so other threads
get locked out.

This means the locking order has now been inverted,
with the pagetable_lock on the outside and the
anon_vma locks on the inside.

I have checked all the other call sites to the
anon_vma code.  The direct callers of anon_vma_clone
and anon_vma_fork already hold the mmap_sem for
write.  The callers of anon_vma_prepare hold the
mmap_sem for read - so excluding other callers of
anon_vma_prepare with the page_table_lock is enough.

mm_take_all_locks has the mmap_sem for write.

There seem to be no other traversals of the same_vma
list, so changing the locking order to have the
page_table_lock on the outside of the anon_vma locks
works.

> plus, Why don't we need "vma->anon_vma = merge_vma->anon_vma" assignment?
> if vma->anon_vma keep NULL, I think anon_vma_prepare() call anon_vma_clone()
> multiple times.

Added in the new version.  See the next email.


^ permalink raw reply	[flat|nested] 242+ messages in thread

* [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA
  2010-04-07  7:00                                         ` KOSAKI Motohiro
  2010-04-07 14:48                                           ` Rik van Riel
@ 2010-04-07 14:54                                           ` Rik van Riel
  2010-04-07 15:30                                             ` Linus Torvalds
  2010-04-07 15:55                                             ` Minchan Kim
  1 sibling, 2 replies; 242+ messages in thread
From: Rik van Riel @ 2010-04-07 14:54 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: kosaki.motohiro, Linus Torvalds, Borislav Petkov, Andrew Morton,
	Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn,
	Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson, hannes

When a new VMA has a mergeable anon_vma with a neighboring VMA,
make sure all of the neighbor's old anon_vma structs are also
linked in.

This is necessary because at some point the VMAs could get merged,
and we want to ensure no anon_vma structs get freed prematurely,
while the system still has anonymous pages that belong to those
structs.

Reported-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Rik van Riel <riel@redhat.com>

--- 
v2:
 - fix the locking issues spotted by Kosaki Motohiro
 - set vma->anon_vma correctly

 include/linux/mm.h |    2 +-
 mm/mmap.c          |    6 +++---
 mm/rmap.c          |   27 ++++++++++++++++++---------
 3 files changed, 22 insertions(+), 13 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index e70f21b..90ac50e 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1228,7 +1228,7 @@ extern struct vm_area_struct *vma_merge(struct mm_struct *,
 	struct vm_area_struct *prev, unsigned long addr, unsigned long end,
 	unsigned long vm_flags, struct anon_vma *, struct file *, pgoff_t,
 	struct mempolicy *);
-extern struct anon_vma *find_mergeable_anon_vma(struct vm_area_struct *);
+extern struct vm_area_struct *find_mergeable_anon_vma(struct vm_area_struct *);
 extern int split_vma(struct mm_struct *,
 	struct vm_area_struct *, unsigned long addr, int new_below);
 extern int insert_vm_struct(struct mm_struct *, struct vm_area_struct *);
diff --git a/mm/mmap.c b/mm/mmap.c
index 75557c6..bf0600c 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -832,7 +832,7 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm,
  * anon_vmas being allocated, preventing vma merge in subsequent
  * mprotect.
  */
-struct anon_vma *find_mergeable_anon_vma(struct vm_area_struct *vma)
+struct vm_area_struct *find_mergeable_anon_vma(struct vm_area_struct *vma)
 {
 	struct vm_area_struct *near;
 	unsigned long vm_flags;
@@ -855,7 +855,7 @@ struct anon_vma *find_mergeable_anon_vma(struct vm_area_struct *vma)
 			can_vma_merge_before(near, vm_flags,
 				NULL, vma->vm_file, vma->vm_pgoff +
 				((vma->vm_end - vma->vm_start) >> PAGE_SHIFT)))
-		return near->anon_vma;
+		return near;
 try_prev:
 	/*
 	 * It is potentially slow to have to call find_vma_prev here.
@@ -875,7 +875,7 @@ try_prev:
   			mpol_equal(vma_policy(near), vma_policy(vma)) &&
 			can_vma_merge_after(near, vm_flags,
 				NULL, vma->vm_file, vma->vm_pgoff))
-		return near->anon_vma;
+		return near;
 none:
 	/*
 	 * There's no absolute need to look only at touching neighbours:
diff --git a/mm/rmap.c b/mm/rmap.c
index eaa7a09..abe7aa5 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -119,24 +119,33 @@ int anon_vma_prepare(struct vm_area_struct *vma)
 	might_sleep();
 	if (unlikely(!anon_vma)) {
 		struct mm_struct *mm = vma->vm_mm;
+		struct vm_area_struct *merge_vma;
 		struct anon_vma *allocated;
 
+		merge_vma = find_mergeable_anon_vma(vma);
+		if (merge_vma) {
+			int ret;
+			spin_lock(&mm->page_table_lock);
+			ret = anon_vma_clone(vma, merge_vma);
+			if (!ret)
+				vma->anon_vma = merge_vma->anon_vma;
+			spin_unlock(&mm->page_table_lock);
+			return ret;
+		}
+
 		avc = anon_vma_chain_alloc();
 		if (!avc)
 			goto out_enomem;
 
-		anon_vma = find_mergeable_anon_vma(vma);
 		allocated = NULL;
-		if (!anon_vma) {
-			anon_vma = anon_vma_alloc();
-			if (unlikely(!anon_vma))
-				goto out_enomem_free_avc;
-			allocated = anon_vma;
-		}
-		spin_lock(&anon_vma->lock);
+		anon_vma = anon_vma_alloc();
+		if (unlikely(!anon_vma))
+			goto out_enomem_free_avc;
+		allocated = anon_vma;
 
 		/* page_table_lock to protect against threads */
 		spin_lock(&mm->page_table_lock);
+		spin_lock(&anon_vma->lock);
 		if (likely(!vma->anon_vma)) {
 			vma->anon_vma = anon_vma;
 			avc->anon_vma = anon_vma;
@@ -145,9 +154,9 @@ int anon_vma_prepare(struct vm_area_struct *vma)
 			list_add(&avc->same_anon_vma, &anon_vma->head);
 			allocated = NULL;
 		}
+		spin_unlock(&anon_vma->lock);
 		spin_unlock(&mm->page_table_lock);
 
-		spin_unlock(&anon_vma->lock);
 		if (unlikely(allocated)) {
 			anon_vma_free(allocated);
 			anon_vma_chain_free(avc);


^ permalink raw reply related	[flat|nested] 242+ messages in thread

* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA
  2010-04-07 14:54                                           ` [PATCH -v2] " Rik van Riel
@ 2010-04-07 15:30                                             ` Linus Torvalds
  2010-04-07 15:52                                               ` Rik van Riel
  2010-04-07 15:55                                             ` Minchan Kim
  1 sibling, 1 reply; 242+ messages in thread
From: Linus Torvalds @ 2010-04-07 15:30 UTC (permalink / raw)
  To: Rik van Riel
  Cc: KOSAKI Motohiro, Borislav Petkov, Andrew Morton, Minchan Kim,
	Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin,
	Andrea Arcangeli, Hugh Dickins, sgunderson, hannes



On Wed, 7 Apr 2010, Rik van Riel wrote:
>
>  - fix the locking issues spotted by Kosaki Motohiro

No, they're broken.

And Rik, please explain the locking rather than make even more of these 
kinds of random ad-hoc locking rules.

I've said this now _three_ times, but let me repeat once more:

 - the locking rules for that anon_vma_chain are very unclear. I _think_ 
   you mean for them to be "mmap_sem held for writing, _or_ mmap_sem held 
   for reading and page_table_lock held", but nowhere is that actually 
   documented.

Why is it so hard for you to just admit that? Especially after you 
yourself got it wrong.

> +		merge_vma = find_mergeable_anon_vma(vma);
> +		if (merge_vma) {
> +			int ret;
> +			spin_lock(&mm->page_table_lock);
> +			ret = anon_vma_clone(vma, merge_vma);
> +			if (!ret)
> +				vma->anon_vma = merge_vma->anon_vma;
> +			spin_unlock(&mm->page_table_lock);
> +			return ret;
> +		}

Rik, the above is obviously total crap.

anon_vma_clone() needs to allocate memory, and it does so with GFP_KERNEL. 
You can't do that with a spinlock held.

		Linus

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3)
  2010-04-07  8:36                         ` Peter Zijlstra
  2010-04-07  9:16                           ` Johannes Weiner
  2010-04-07 14:12                           ` Rik van Riel
@ 2010-04-07 15:46                           ` Linus Torvalds
  2 siblings, 0 replies; 242+ messages in thread
From: Linus Torvalds @ 2010-04-07 15:46 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Rik van Riel, Minchan Kim, KOSAKI Motohiro, Borislav Petkov,
	Andrew Morton, Linux Kernel Mailing List, Lee Schermerhorn,
	Nick Piggin, Andrea Arcangeli, Hugh Dickins



On Wed, 7 Apr 2010, Peter Zijlstra wrote:

> On Tue, 2010-04-06 at 11:28 -0700, Linus Torvalds wrote:
> > Just as an example of the kind of code that makes me worry:
> > 
> >         void unlink_anon_vmas(struct vm_area_struct *vma)
> >         {
> >                 struct anon_vma_chain *avc, *next;
> >                         
> >                 /* Unlink each anon_vma chained to the VMA. */
> >                 list_for_each_entry_safe(avc, next, &vma->anon_vma_chain, same_vma) {
> >                         anon_vma_unlink(avc);
> >                         list_del(&avc->same_vma);
> >                         anon_vma_chain_free(avc);
> >                 }
> >         }
> > 
> > Now, think about what happens for the *last* entry in that avc chain. It 
> > will call that "anon_vma_unlink()" thing, which will delete perhaps the 
> > last entry in the "same_anon_vma" one, and then it does
> > 
> >         if (empty)
> >                 anon_vma_free(anon_vma);
> > 
> > *before* unlink_anon_vma's has actually does that
> > 
> >         list_del(&avc->same_vma);
> > 
> > and what we essentially have is a stale anon_vma_chain entry that still 
> > exists on that same_vma list, and points to an anon_vma that already got 
> > deleted.
> > 
> > Does it matter? I really can't see that it does. 
> 
> I think it does, the anon_vma thing has an RCU destroyed slab, but that
> doesn't mean the anon_vma object itself is rcu delayed. The moment we
> free it it can be re-used. So the above use after free is a bug.

Well, it's not really a "use after free" - it's just that a stale pointer 
still exists in a live data structure that is linked into the list. I 
don't think there is a real bug there, simply because I don't think 
anybody will be accessing that list (we should hopefully have all the 
sufficient mutual exclusion in place).

So I just think it is bad form to potentially free something before we get 
rid of all pointers to it. So to me it's a cleanliness issue: good code 
shouldn't do things like that, and it would be much cleaner to remove the 
AVC entry that has a pointer to the anon_vma _before_ we might be freeing 
the anon_vma.

Maybe I'm just anal.

		Linus

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA
  2010-04-07 15:30                                             ` Linus Torvalds
@ 2010-04-07 15:52                                               ` Rik van Riel
  2010-04-07 16:56                                                 ` Linus Torvalds
  0 siblings, 1 reply; 242+ messages in thread
From: Rik van Riel @ 2010-04-07 15:52 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: KOSAKI Motohiro, Borislav Petkov, Andrew Morton, Minchan Kim,
	Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin,
	Andrea Arcangeli, Hugh Dickins, sgunderson, hannes

On 04/07/2010 11:30 AM, Linus Torvalds wrote:

> I've said this now _three_ times, but let me repeat once more:
>
>   - the locking rules for that anon_vma_chain are very unclear. I _think_
>     you mean for them to be "mmap_sem held for writing, _or_ mmap_sem held
>     for reading and page_table_lock held", but nowhere is that actually
>     documented.

> Why is it so hard for you to just admit that? Especially after you
> yourself got it wrong.

You are right, the idea was to continue use the locking that
the anon_vma code was already using, without introducing any
new locking with the anon_vma patches.

However, it has become clear that this is no longer possible,
due to the need to hold a secondary lock across anon_vma_clone,
when we come from a code path that holds the mmap_sem for read.

>> +		merge_vma = find_mergeable_anon_vma(vma);
>> +		if (merge_vma) {
>> +			int ret;
>> +			spin_lock(&mm->page_table_lock);
>> +			ret = anon_vma_clone(vma, merge_vma);
>> +			if (!ret)
>> +				vma->anon_vma = merge_vma->anon_vma;
>> +			spin_unlock(&mm->page_table_lock);
>> +			return ret;
>> +		}
>
> Rik, the above is obviously total crap.
>
> anon_vma_clone() needs to allocate memory, and it does so with GFP_KERNEL.
> You can't do that with a spinlock held.

Looks like we'll either have to introduce a per-mm semaphore for
the same_vma anon_vma chains, or move the complexity of solving
this bug to anon_vma_merge, where we can ensure that the resulting
VMA has the sum of the anon_vmas of each VMA.

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas  of a mergeable VMA
  2010-04-07 14:54                                           ` [PATCH -v2] " Rik van Riel
  2010-04-07 15:30                                             ` Linus Torvalds
@ 2010-04-07 15:55                                             ` Minchan Kim
  1 sibling, 0 replies; 242+ messages in thread
From: Minchan Kim @ 2010-04-07 15:55 UTC (permalink / raw)
  To: Rik van Riel
  Cc: KOSAKI Motohiro, Linus Torvalds, Borislav Petkov, Andrew Morton,
	Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin,
	Andrea Arcangeli, Hugh Dickins, sgunderson, hannes

Hi, Rik.

On Wed, Apr 7, 2010 at 11:54 PM, Rik van Riel <riel@redhat.com> wrote:
> When a new VMA has a mergeable anon_vma with a neighboring VMA,
> make sure all of the neighbor's old anon_vma structs are also
> linked in.
>
> This is necessary because at some point the VMAs could get merged,
> and we want to ensure no anon_vma structs get freed prematurely,
> while the system still has anonymous pages that belong to those
> structs.
>
> Reported-by: Borislav Petkov <bp@alien8.de>
> Signed-off-by: Rik van Riel <riel@redhat.com>

At last, you might find culprit.

AFAIU your descriptoin, don't we have to care vma_merge case, too?
Sorry if it is dumb question.

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA
  2010-04-07 15:52                                               ` Rik van Riel
@ 2010-04-07 16:56                                                 ` Linus Torvalds
  2010-04-07 21:19                                                   ` Linus Torvalds
  0 siblings, 1 reply; 242+ messages in thread
From: Linus Torvalds @ 2010-04-07 16:56 UTC (permalink / raw)
  To: Rik van Riel
  Cc: KOSAKI Motohiro, Borislav Petkov, Andrew Morton, Minchan Kim,
	Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin,
	Andrea Arcangeli, Hugh Dickins, sgunderson, hannes



On Wed, 7 Apr 2010, Rik van Riel wrote:
> 
> You are right, the idea was to continue use the locking that
> the anon_vma code was already using, without introducing any
> new locking with the anon_vma patches.
> 
> However, it has become clear that this is no longer possible,
> due to the need to hold a secondary lock across anon_vma_clone,
> when we come from a code path that holds the mmap_sem for read.

I do wonder if we could possibly simplify this a _lot_ by just requiring 
that the anon_vma gets allocated at vma creation time (ie mmap), rather 
than doing it on-demand when we actually do the page fault.

That would make all of this crap happen under mmap_sem held for writing, 
and it would simplify the faulting code (which is the much more critical 
code) a lot.

And it would make all your locking problems go away. Now all anon_vma code 
really _would_ run with mmap_sem held exclusively, without any races.

When I tried to do a "fill in multiple page table entries in one go" 
patch, that annoying anon_vma issue was a problem as well. Allocating the 
anon_vma up-front would have simplified that code too.

I can't imagine that we ever really have mappings without an anon_vma in 
practice _anyway_, so why delay the allocation until page fault time?

Maybe I'm missing something subtle. 

			Linus

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA
  2010-04-07 16:56                                                 ` Linus Torvalds
@ 2010-04-07 21:19                                                   ` Linus Torvalds
  2010-04-07 21:52                                                     ` Rik van Riel
  0 siblings, 1 reply; 242+ messages in thread
From: Linus Torvalds @ 2010-04-07 21:19 UTC (permalink / raw)
  To: Rik van Riel
  Cc: KOSAKI Motohiro, Borislav Petkov, Andrew Morton, Minchan Kim,
	Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin,
	Andrea Arcangeli, Hugh Dickins, sgunderson, hannes



On Wed, 7 Apr 2010, Linus Torvalds wrote:
> 
> I do wonder if we could possibly simplify this a _lot_ by just requiring 
> that the anon_vma gets allocated at vma creation time (ie mmap), rather 
> than doing it on-demand when we actually do the page fault.
> 
> That would make all of this crap happen under mmap_sem held for writing, 
> and it would simplify the faulting code (which is the much more critical 
> code) a lot.

Here is a patch that boots for me (but has had _zero_ serious testing: 
caveat emptor etc etc).

It basically moves "anon_vma_prepare()" to be called in vma_link and in 
__insert_vm_struct() - which I _think_ should cover all normal vma 
creation events. I did a "WARN_ONCE(!vma->anon_vma)" just to check, I 
haven't triggered one yet.

Now, this clearly will create anon_vma's that may never get used at all, 
ie for things like shared mappings etc that never have anonymous memory 
associated with them. But that structure is pretty small, so I don't find 
it in myself to care too deeply.

And with this, all the anon_vma games shuld all happen with mmap_sem held 
for writing, which should hopefully simplify things a lot. Rik, can you 
use this to make a new version of your fixing patch?

Comments?

		Linus

---
 mm/memory.c |   10 +---------
 mm/mmap.c   |   17 ++++-------------
 2 files changed, 5 insertions(+), 22 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 833952d..0abefd8 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2223,9 +2223,6 @@ reuse:
 gotten:
 	pte_unmap_unlock(page_table, ptl);
 
-	if (unlikely(anon_vma_prepare(vma)))
-		goto oom;
-
 	if (is_zero_pfn(pte_pfn(orig_pte))) {
 		new_page = alloc_zeroed_user_highpage_movable(vma, address);
 		if (!new_page)
@@ -2766,8 +2763,6 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	/* Allocate our own private page. */
 	pte_unmap(page_table);
 
-	if (unlikely(anon_vma_prepare(vma)))
-		goto oom;
 	page = alloc_zeroed_user_highpage_movable(vma, address);
 	if (!page)
 		goto oom;
@@ -2863,10 +2858,6 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (flags & FAULT_FLAG_WRITE) {
 		if (!(vma->vm_flags & VM_SHARED)) {
 			anon = 1;
-			if (unlikely(anon_vma_prepare(vma))) {
-				ret = VM_FAULT_OOM;
-				goto out;
-			}
 			page = alloc_page_vma(GFP_HIGHUSER_MOVABLE,
 						vma, address);
 			if (!page) {
@@ -3115,6 +3106,7 @@ int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	pmd_t *pmd;
 	pte_t *pte;
 
+	WARN_ONCE(!vma->anon_vma, "No anonvma");
 	__set_current_state(TASK_RUNNING);
 
 	count_vm_event(PGFAULT);
diff --git a/mm/mmap.c b/mm/mmap.c
index 75557c6..c14284b 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -463,6 +463,8 @@ static void vma_link(struct mm_struct *mm, struct vm_area_struct *vma,
 
 	mm->map_count++;
 	validate_mm(mm);
+
+	anon_vma_prepare(vma);
 }
 
 /*
@@ -479,6 +481,8 @@ static void __insert_vm_struct(struct mm_struct *mm, struct vm_area_struct *vma)
 	BUG_ON(__vma && __vma->vm_start < vma->vm_end);
 	__vma_link(mm, vma, prev, rb_link, rb_parent);
 	mm->map_count++;
+
+	anon_vma_prepare(vma);
 }
 
 static inline void
@@ -1674,12 +1678,6 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address)
 	if (!(vma->vm_flags & VM_GROWSUP))
 		return -EFAULT;
 
-	/*
-	 * We must make sure the anon_vma is allocated
-	 * so that the anon_vma locking is not a noop.
-	 */
-	if (unlikely(anon_vma_prepare(vma)))
-		return -ENOMEM;
 	anon_vma_lock(vma);
 
 	/*
@@ -1720,13 +1718,6 @@ static int expand_downwards(struct vm_area_struct *vma,
 {
 	int error;
 
-	/*
-	 * We must make sure the anon_vma is allocated
-	 * so that the anon_vma locking is not a noop.
-	 */
-	if (unlikely(anon_vma_prepare(vma)))
-		return -ENOMEM;
-
 	address &= PAGE_MASK;
 	error = security_file_mmap(NULL, 0, 0, 0, address, 1);
 	if (error)

^ permalink raw reply related	[flat|nested] 242+ messages in thread

* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA
  2010-04-07 21:19                                                   ` Linus Torvalds
@ 2010-04-07 21:52                                                     ` Rik van Riel
  2010-04-07 22:09                                                       ` Linus Torvalds
  0 siblings, 1 reply; 242+ messages in thread
From: Rik van Riel @ 2010-04-07 21:52 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: KOSAKI Motohiro, Borislav Petkov, Andrew Morton, Minchan Kim,
	Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin,
	Andrea Arcangeli, Hugh Dickins, sgunderson, hannes

On 04/07/2010 05:19 PM, Linus Torvalds wrote:

> Comments?

I remember there being an "unfixable" spot with this
approach when I originally wrote the new anon_vma
linking code.

However, I can't for the life of me find that spot.
I am starting to believe I made it fixable as a side
effect of one of the changes I made :)

One of the issues with your patch is that anon_vma_prepare
can fail and this patch ignores its return value.

Having anon_vma-prepare fail after an mremap or mprotect
might result in messing up the VMAs of a process, or having
to undo the VMA changes that were made.

In fact, this may be the problem I was running into - not
wanting to add even more complex error paths to the vma
shuffling code.

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA
  2010-04-07 21:52                                                     ` Rik van Riel
@ 2010-04-07 22:09                                                       ` Linus Torvalds
  2010-04-07 22:15                                                         ` Linus Torvalds
  2010-04-07 23:37                                                         ` Linus Torvalds
  0 siblings, 2 replies; 242+ messages in thread
From: Linus Torvalds @ 2010-04-07 22:09 UTC (permalink / raw)
  To: Rik van Riel
  Cc: KOSAKI Motohiro, Borislav Petkov, Andrew Morton, Minchan Kim,
	Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin,
	Andrea Arcangeli, Hugh Dickins, sgunderson, hannes



On Wed, 7 Apr 2010, Rik van Riel wrote:
> 
> One of the issues with your patch is that anon_vma_prepare
> can fail and this patch ignores its return value.

Yes. The failure point is too late to do anything really interesting with, 
and the old code also just causes a SIGBUS. My intention was to change the 

	WARN_ONCE(!vma->anon_vma);

into returning that SIGBUS - which is not wonderful, but is no different 
from old failures.

In the long run, it would be nicer to actually return an error from the 
mmap() that fails, but that's more complicated, and as mentioned, it's not 
what the old code used to do either (since the failure point was always at 
the page fault stage).

> Having anon_vma-prepare fail after an mremap or mprotect
> might result in messing up the VMAs of a process, or having
> to undo the VMA changes that were made.

We really aren't any worse off than we have always been.

If anon_vma_prepare() fails, the vma list will be valid, but no new pages 
can be added to that vma. That used to be true before too.

			Linus

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA
  2010-04-07 22:09                                                       ` Linus Torvalds
@ 2010-04-07 22:15                                                         ` Linus Torvalds
  2010-04-08  0:38                                                           ` Rik van Riel
  2010-04-07 23:37                                                         ` Linus Torvalds
  1 sibling, 1 reply; 242+ messages in thread
From: Linus Torvalds @ 2010-04-07 22:15 UTC (permalink / raw)
  To: Rik van Riel
  Cc: KOSAKI Motohiro, Borislav Petkov, Andrew Morton, Minchan Kim,
	Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin,
	Andrea Arcangeli, Hugh Dickins, sgunderson, hannes



On Wed, 7 Apr 2010, Linus Torvalds wrote:
> 
> In the long run, it would be nicer to actually return an error from the 
> mmap() that fails, but that's more complicated, and as mentioned, it's not 
> what the old code used to do either (since the failure point was always at 
> the page fault stage).

Put another way: I'm not proud of it, but the new code isn't any worse 
than what we used to have, and I think the new code is _fixable_.

The easiest way to do that would likely be to pre-allocate the anon_vma 
struct (and anon_vma_chain), and pass it down to anon_vma_prepare. That 
way anon_vma_prepare() itself can never fail, and all we need to do is a 
simple allocation earlier in the call-chain.

			Linus

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA
  2010-04-07 22:09                                                       ` Linus Torvalds
  2010-04-07 22:15                                                         ` Linus Torvalds
@ 2010-04-07 23:37                                                         ` Linus Torvalds
  2010-04-08  2:03                                                           ` KOSAKI Motohiro
  1 sibling, 1 reply; 242+ messages in thread
From: Linus Torvalds @ 2010-04-07 23:37 UTC (permalink / raw)
  To: Rik van Riel
  Cc: KOSAKI Motohiro, Borislav Petkov, Andrew Morton, Minchan Kim,
	Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin,
	Andrea Arcangeli, Hugh Dickins, sgunderson, hannes



On Wed, 7 Apr 2010, Linus Torvalds wrote:
> 
> Yes. The failure point is too late to do anything really interesting with, 
> and the old code also just causes a SIGBUS. My intention was to change the 
> 
> 	WARN_ONCE(!vma->anon_vma);
> 
> into returning that SIGBUS - which is not wonderful, but is no different 
> from old failures.

Not SIGBUS, but VM_FAULT_OOM, of course.

IOW, something like this should be no worse than what we have now, and has 
the much nicer locking semantics.

Having done some more digging, I can point to a downside: we do end up 
having about twice as many anon_vma entries. It seems about half of the 
vma's never need an anon_vma entry, probably because they end up being 
read-only file mappings, and thus never trigger the anonvma case.

That said:

 - I don't really think you can fix the locking problem you have in a 
   saner way

 - the anon_vma entry is much smaller than the vm_area_struct, so we're 
   still using much less memory for them than for vma's.

 - We _could_ avoid allocating anonvma entries for shared mappings or for 
   mappings that are read-only. That might force us to allocate some of 
   them at mprotect time, and/or when doing a forced COW event with 
   ptrace, but we have the mmap_sem for writing for the one case, and we 
   could decide to get it for the other.

   So it's not a _fundamental_ problem if we decide we want to recover 
   most of the memory lost by doing unconditional allocations.

There are alternative models. For example, the VM layer _could_ decide to 
just release the mmap_sem, and re-do it and take it for writing if the vma 
doesn't have an anon_vma. 

I dunno. I like how this patch makes things so much less subtle, though.  
For example: with this in place, we could further simplify 
anon_vma_prepare(), since it would now never have the re-entrancy issue 
and wouldn't need to worry about taking that page_table_lock and 
re-testing vma->anon_vma for races.

		Linus

---
 mm/memory.c |   12 +++---------
 mm/mmap.c   |   17 ++++-------------
 2 files changed, 7 insertions(+), 22 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 833952d..b5efe76 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2223,9 +2223,6 @@ reuse:
 gotten:
 	pte_unmap_unlock(page_table, ptl);
 
-	if (unlikely(anon_vma_prepare(vma)))
-		goto oom;
-
 	if (is_zero_pfn(pte_pfn(orig_pte))) {
 		new_page = alloc_zeroed_user_highpage_movable(vma, address);
 		if (!new_page)
@@ -2766,8 +2763,6 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	/* Allocate our own private page. */
 	pte_unmap(page_table);
 
-	if (unlikely(anon_vma_prepare(vma)))
-		goto oom;
 	page = alloc_zeroed_user_highpage_movable(vma, address);
 	if (!page)
 		goto oom;
@@ -2863,10 +2858,6 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (flags & FAULT_FLAG_WRITE) {
 		if (!(vma->vm_flags & VM_SHARED)) {
 			anon = 1;
-			if (unlikely(anon_vma_prepare(vma))) {
-				ret = VM_FAULT_OOM;
-				goto out;
-			}
 			page = alloc_page_vma(GFP_HIGHUSER_MOVABLE,
 						vma, address);
 			if (!page) {
@@ -3115,6 +3106,9 @@ int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	pmd_t *pmd;
 	pte_t *pte;
 
+	if (!vma->anon_vma)
+		return VM_FAULT_OOM;
+
 	__set_current_state(TASK_RUNNING);
 
 	count_vm_event(PGFAULT);
diff --git a/mm/mmap.c b/mm/mmap.c
index 75557c6..c14284b 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -463,6 +463,8 @@ static void vma_link(struct mm_struct *mm, struct vm_area_struct *vma,
 
 	mm->map_count++;
 	validate_mm(mm);
+
+	anon_vma_prepare(vma);
 }
 
 /*
@@ -479,6 +481,8 @@ static void __insert_vm_struct(struct mm_struct *mm, struct vm_area_struct *vma)
 	BUG_ON(__vma && __vma->vm_start < vma->vm_end);
 	__vma_link(mm, vma, prev, rb_link, rb_parent);
 	mm->map_count++;
+
+	anon_vma_prepare(vma);
 }
 
 static inline void
@@ -1674,12 +1678,6 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address)
 	if (!(vma->vm_flags & VM_GROWSUP))
 		return -EFAULT;
 
-	/*
-	 * We must make sure the anon_vma is allocated
-	 * so that the anon_vma locking is not a noop.
-	 */
-	if (unlikely(anon_vma_prepare(vma)))
-		return -ENOMEM;
 	anon_vma_lock(vma);
 
 	/*
@@ -1720,13 +1718,6 @@ static int expand_downwards(struct vm_area_struct *vma,
 {
 	int error;
 
-	/*
-	 * We must make sure the anon_vma is allocated
-	 * so that the anon_vma locking is not a noop.
-	 */
-	if (unlikely(anon_vma_prepare(vma)))
-		return -ENOMEM;
-
 	address &= PAGE_MASK;
 	error = security_file_mmap(NULL, 0, 0, 0, address, 1);
 	if (error)

^ permalink raw reply related	[flat|nested] 242+ messages in thread

* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA
  2010-04-07 22:15                                                         ` Linus Torvalds
@ 2010-04-08  0:38                                                           ` Rik van Riel
  0 siblings, 0 replies; 242+ messages in thread
From: Rik van Riel @ 2010-04-08  0:38 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: KOSAKI Motohiro, Borislav Petkov, Andrew Morton, Minchan Kim,
	Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin,
	Andrea Arcangeli, Hugh Dickins, sgunderson, hannes

On 04/07/2010 06:15 PM, Linus Torvalds wrote:
> On Wed, 7 Apr 2010, Linus Torvalds wrote:
>>
>> In the long run, it would be nicer to actually return an error from the
>> mmap() that fails, but that's more complicated, and as mentioned, it's not
>> what the old code used to do either (since the failure point was always at
>> the page fault stage).
>
> Put another way: I'm not proud of it, but the new code isn't any worse
> than what we used to have, and I think the new code is _fixable_.

Agreed, it is no worse than what we had before.

As to fixable, I supect both situations are fixable.
The new code by getting the error paths right, the
old code by completely bailing out of the page fault
and retrying it (the pageout code should trigger an
OOM kill at some point, if we are really out of memory).

> The easiest way to do that would likely be to pre-allocate the anon_vma
> struct (and anon_vma_chain), and pass it down to anon_vma_prepare. That
> way anon_vma_prepare() itself can never fail, and all we need to do is a
> simple allocation earlier in the call-chain.

That may not work, because we may want to merge the anon_vma
with the anon_vma in an adjacant VMA ... and that adjacant
VMA could be chained onto multiple anon_vmas.

That means allocating a single anon_vma_chain may not be
enough.

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA
  2010-04-07 23:37                                                         ` Linus Torvalds
@ 2010-04-08  2:03                                                           ` KOSAKI Motohiro
  2010-04-08  2:33                                                             ` Linus Torvalds
  0 siblings, 1 reply; 242+ messages in thread
From: KOSAKI Motohiro @ 2010-04-08  2:03 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: kosaki.motohiro, Rik van Riel, Borislav Petkov, Andrew Morton,
	Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn,
	Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson, hannes

Hi

Wow, your patch is very cool. I'm surprising such 20 lines patch makes
lots simplify.

> On Wed, 7 Apr 2010, Linus Torvalds wrote:
> > 
> > Yes. The failure point is too late to do anything really interesting with, 
> > and the old code also just causes a SIGBUS. My intention was to change the 
> > 
> > 	WARN_ONCE(!vma->anon_vma);
> > 
> > into returning that SIGBUS - which is not wonderful, but is no different 
> > from old failures.
> 
> Not SIGBUS, but VM_FAULT_OOM, of course.

Now pagefault don't insert anon_vma anymore, right? if so, SIGBUS is better.
Now SIGBUS and VM_FAULT_OOM make different result.

SIGBUS       -> kill current task
VM_FAULT_OOM -> invoke oom-killer (see pagefault_out_of_memory())

If current task can't recover proper anon_vma. we should just kill current
instead random highest badness process. otherwise !anon_vma process continue
to randomly invoke oom-killer.

Perhaps, I'm missing something.




^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA
  2010-04-08  2:03                                                           ` KOSAKI Motohiro
@ 2010-04-08  2:33                                                             ` Linus Torvalds
  2010-04-08  5:47                                                               ` Borislav Petkov
  0 siblings, 1 reply; 242+ messages in thread
From: Linus Torvalds @ 2010-04-08  2:33 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Rik van Riel, Borislav Petkov, Andrew Morton, Minchan Kim,
	Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin,
	Andrea Arcangeli, Hugh Dickins, sgunderson, hannes



On Thu, 8 Apr 2010, KOSAKI Motohiro wrote:
> 
> Now pagefault don't insert anon_vma anymore, right? if so, SIGBUS is better.
> Now SIGBUS and VM_FAULT_OOM make different result.
> 
> SIGBUS       -> kill current task
> VM_FAULT_OOM -> invoke oom-killer (see pagefault_out_of_memory())

Yeah, maybe VM_FAULT_SIGBUS works ok instead of VM_FAULT_OOM. But the 
cause of it is the system having been oom when themappign was created, so 
I think either is fine.

> If current task can't recover proper anon_vma. we should just kill current
> instead random highest badness process. otherwise !anon_vma process continue
> to randomly invoke oom-killer.

Yes, that is a good point.

Anyway, I think it might be interesting to test my anon_vma_prepare() 
locking change patch together with Rik's _first_ version of his "fix 
anon_vma_prepare" thing (the one without the spinlock). They should apply 
independently of each other, and maybe it all even works together.

		Linus

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA
  2010-04-08  2:33                                                             ` Linus Torvalds
@ 2010-04-08  5:47                                                               ` Borislav Petkov
  2010-04-08 14:11                                                                 ` Linus Torvalds
  0 siblings, 1 reply; 242+ messages in thread
From: Borislav Petkov @ 2010-04-08  5:47 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: KOSAKI Motohiro, Rik van Riel, Andrew Morton, Minchan Kim,
	Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin,
	Andrea Arcangeli, Hugh Dickins, sgunderson, hannes

From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Wed, Apr 07, 2010 at 07:33:01PM -0700

> Anyway, I think it might be interesting to test my anon_vma_prepare() 
> locking change patch together with Rik's _first_ version of his "fix 
> anon_vma_prepare" thing (the one without the spinlock). They should apply 
> independently of each other, and maybe it all even works together.

There are still issues: vma_adjust() grabs mapping->i_mmap_lock for file
mappings while we might sleep in anon_vma_prepare():

[    9.386929] BUG: sleeping function called from invalid context at mm/rmap.c:119
[    9.387188] in_atomic(): 1, irqs_disabled(): 0, pid: 1068, name: modprobe
[    9.387343] 3 locks held by modprobe/1068:
[    9.387524]  #0:  (&p->cred_guard_mutex){+.+.+.}, at: [<ffffffff810d97fc>] prepare_bprm_creds+0x29/0x5a
[    9.387959]  #1:  (&mm->mmap_sem){++++++}, at: [<ffffffff81110ee2>] elf_map+0x70/0x190
[    9.388416]  #2:  (&(&inode->i_data.i_mmap_lock)->rlock){+.+...}, at: [<ffffffff810bcbdf>] vma_adjust+0x190
/0x3ca
[    9.388848] Pid: 1068, comm: modprobe Not tainted 2.6.34-rc3-00290-ge4b2849 #6
[    9.389102] Call Trace:
[    9.389256]  [<ffffffff810630f6>] ? __debug_show_held_locks+0x22/0x24
[    9.389418]  [<ffffffff8102c288>] __might_sleep+0x117/0x11b
[    9.389570]  [<ffffffff810c0f2e>] anon_vma_prepare+0x30/0x132
[    9.389722]  [<ffffffff810bcd95>] vma_adjust+0x346/0x3ca
[    9.389874]  [<ffffffff810bcf68>] __split_vma+0x14f/0x1b9
[    9.390027]  [<ffffffff810bd143>] do_munmap+0x171/0x315
[    9.390181]  [<ffffffff81110ee2>] ? elf_map+0x70/0x190
[    9.390335]  [<ffffffff81110f9d>] elf_map+0x12b/0x190
[    9.390493]  [<ffffffff81111b35>] load_elf_binary+0xb33/0x170e
[    9.390645]  [<ffffffff8102d529>] ? sub_preempt_count+0xa3/0xb6
[    9.390800]  [<ffffffff810d945a>] search_binary_handler+0x166/0x30e
[    9.390952]  [<ffffffff810d92ab>] ? copy_strings+0x1d4/0x1e5
[    9.391111]  [<ffffffff81111002>] ? load_elf_binary+0x0/0x170e
[    9.391265]  [<ffffffff810dadff>] do_execve+0x1fc/0x2f5
[    9.391424]  [<ffffffff8100a379>] sys_execve+0x43/0x61
[    9.391576]  [<ffffffff810025fa>] stub_execve+0x6a/0xc0


-- 
Regards/Gruss,
Boris.

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA
  2010-04-08  5:47                                                               ` Borislav Petkov
@ 2010-04-08 14:11                                                                 ` Linus Torvalds
  2010-04-08 18:25                                                                   ` Rik van Riel
  2010-04-08 21:00                                                                   ` Borislav Petkov
  0 siblings, 2 replies; 242+ messages in thread
From: Linus Torvalds @ 2010-04-08 14:11 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: KOSAKI Motohiro, Rik van Riel, Andrew Morton, Minchan Kim,
	Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin,
	Andrea Arcangeli, Hugh Dickins, sgunderson, hannes



On Thu, 8 Apr 2010, Borislav Petkov wrote:
> 
> There are still issues: vma_adjust() grabs mapping->i_mmap_lock for file
> mappings while we might sleep in anon_vma_prepare():

Ahh. Good catch. So I can't actually do that anon_vma_prepare() thing in 
__insert_vm_struct.

It should be simple enough to just move it into the caller, just after it 
releases that lock. There's only one user of that __insert_vm_struct() 
anyway. You can do it yourself, or you can replace my previous patch with 
this..

[ The patch below also makes it warn once and return SIGBUS for the case 
  where there is no anon_vma.  I decided I still want to hear about it if 
  there might be some path that tries to insert a vma on its own ]

		Linus

---
 mm/memory.c |   12 +++---------
 mm/mmap.c   |   17 ++++-------------
 2 files changed, 7 insertions(+), 22 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 833952d..08d4423 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2223,9 +2223,6 @@ reuse:
 gotten:
 	pte_unmap_unlock(page_table, ptl);
 
-	if (unlikely(anon_vma_prepare(vma)))
-		goto oom;
-
 	if (is_zero_pfn(pte_pfn(orig_pte))) {
 		new_page = alloc_zeroed_user_highpage_movable(vma, address);
 		if (!new_page)
@@ -2766,8 +2763,6 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	/* Allocate our own private page. */
 	pte_unmap(page_table);
 
-	if (unlikely(anon_vma_prepare(vma)))
-		goto oom;
 	page = alloc_zeroed_user_highpage_movable(vma, address);
 	if (!page)
 		goto oom;
@@ -2863,10 +2858,6 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (flags & FAULT_FLAG_WRITE) {
 		if (!(vma->vm_flags & VM_SHARED)) {
 			anon = 1;
-			if (unlikely(anon_vma_prepare(vma))) {
-				ret = VM_FAULT_OOM;
-				goto out;
-			}
 			page = alloc_page_vma(GFP_HIGHUSER_MOVABLE,
 						vma, address);
 			if (!page) {
@@ -3115,6 +3106,9 @@ int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	pmd_t *pmd;
 	pte_t *pte;
 
+	if (WARN_ONCE(!vma->anon_vma, "Mapping with no anon_vma"))
+		return VM_FAULT_SIGBUS;
+
 	__set_current_state(TASK_RUNNING);
 
 	count_vm_event(PGFAULT);
diff --git a/mm/mmap.c b/mm/mmap.c
index 75557c6..82392c2 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -463,6 +463,8 @@ static void vma_link(struct mm_struct *mm, struct vm_area_struct *vma,
 
 	mm->map_count++;
 	validate_mm(mm);
+
+	anon_vma_prepare(vma);
 }
 
 /*
@@ -628,6 +630,8 @@ again:			remove_next = 1 + (end > next->vm_end);
 	if (mapping)
 		spin_unlock(&mapping->i_mmap_lock);
 
+	anon_vma_prepare(vma);
+
 	if (remove_next) {
 		if (file) {
 			fput(file);
@@ -1674,12 +1678,6 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address)
 	if (!(vma->vm_flags & VM_GROWSUP))
 		return -EFAULT;
 
-	/*
-	 * We must make sure the anon_vma is allocated
-	 * so that the anon_vma locking is not a noop.
-	 */
-	if (unlikely(anon_vma_prepare(vma)))
-		return -ENOMEM;
 	anon_vma_lock(vma);
 
 	/*
@@ -1720,13 +1718,6 @@ static int expand_downwards(struct vm_area_struct *vma,
 {
 	int error;
 
-	/*
-	 * We must make sure the anon_vma is allocated
-	 * so that the anon_vma locking is not a noop.
-	 */
-	if (unlikely(anon_vma_prepare(vma)))
-		return -ENOMEM;
-
 	address &= PAGE_MASK;
 	error = security_file_mmap(NULL, 0, 0, 0, address, 1);
 	if (error)

^ permalink raw reply related	[flat|nested] 242+ messages in thread

* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA
  2010-04-08 14:11                                                                 ` Linus Torvalds
@ 2010-04-08 18:25                                                                   ` Rik van Riel
  2010-04-08 18:32                                                                     ` Linus Torvalds
  2010-04-08 21:00                                                                   ` Borislav Petkov
  1 sibling, 1 reply; 242+ messages in thread
From: Rik van Riel @ 2010-04-08 18:25 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Borislav Petkov, KOSAKI Motohiro, Andrew Morton, Minchan Kim,
	Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin,
	Andrea Arcangeli, Hugh Dickins, sgunderson, hannes

On 04/08/2010 10:11 AM, Linus Torvalds wrote:
>
>
> On Thu, 8 Apr 2010, Borislav Petkov wrote:
>>
>> There are still issues: vma_adjust() grabs mapping->i_mmap_lock for file
>> mappings while we might sleep in anon_vma_prepare():
>
> Ahh. Good catch. So I can't actually do that anon_vma_prepare() thing in
> __insert_vm_struct.
>
> It should be simple enough to just move it into the caller, just after it
> releases that lock. There's only one user of that __insert_vm_struct()
> anyway. You can do it yourself, or you can replace my previous patch with
> this..
>
> [ The patch below also makes it warn once and return SIGBUS for the case
>    where there is no anon_vma.  I decided I still want to hear about it if
>    there might be some path that tries to insert a vma on its own ]

Reviewed-by: Rik van Riel <riel@redhat.com>

I haven't seen any places that insert VMAs by itself.
Several strange places that allocate them, but they
all appear to use the standard functions to insert them.

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA
  2010-04-08 18:25                                                                   ` Rik van Riel
@ 2010-04-08 18:32                                                                     ` Linus Torvalds
  2010-04-08 20:31                                                                       ` Borislav Petkov
  0 siblings, 1 reply; 242+ messages in thread
From: Linus Torvalds @ 2010-04-08 18:32 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Borislav Petkov, KOSAKI Motohiro, Andrew Morton, Minchan Kim,
	Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin,
	Andrea Arcangeli, Hugh Dickins, sgunderson, hannes



On Thu, 8 Apr 2010, Rik van Riel wrote:
> 
> Reviewed-by: Rik van Riel <riel@redhat.com>

Yeah, I think I'll commit it as-is, assuming we get confirmation that it 
(along with your patch) actually ends up fixing the original problem.

I had actually had lockdep etc on with that patch, but for some reason I'd 
overlooked the SPINLOCK_SLEEP debugging, so I hadn't seen the stupid issue 
that Borislav pointed out. I wonder if LOCKDEP or spinlock debugging hould 
just select it. Small detail, but I should have caught that obvious bug 
myself.

> I haven't seen any places that insert VMAs by itself.
> Several strange places that allocate them, but they
> all appear to use the standard functions to insert them.

Yeah, it's complicated enough to add a vma with all the rbtree etc stuff 
that I hope nobody actually cooks their own. But I too grepped for vma 
allocations, and there were more of them than I expected, so...

		Linus

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA
  2010-04-08 18:32                                                                     ` Linus Torvalds
@ 2010-04-08 20:31                                                                       ` Borislav Petkov
  0 siblings, 0 replies; 242+ messages in thread
From: Borislav Petkov @ 2010-04-08 20:31 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Rik van Riel, KOSAKI Motohiro, Andrew Morton, Minchan Kim,
	Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin,
	Andrea Arcangeli, Hugh Dickins, sgunderson, hannes

From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Thu, Apr 08, 2010 at 11:32:06AM -0700

Here we go, another night of testing starts... got more caffeine this
time :)

> > I haven't seen any places that insert VMAs by itself.
> > Several strange places that allocate them, but they
> > all appear to use the standard functions to insert them.
> 
> Yeah, it's complicated enough to add a vma with all the rbtree etc stuff 
> that I hope nobody actually cooks their own. But I too grepped for vma 
> allocations, and there were more of them than I expected, so...

... and of course, I just hit that WARN_ONCE on the first suspend (it did
suspend ok though):

[   88.078958] ------------[ cut here ]------------
[   88.079007] WARNING: at mm/memory.c:3110 handle_mm_fault+0x56/0x67c()
[   88.079032] Hardware name: System Product Name
[   88.079056] Mapping with no anon_vma
[   88.079082] Modules linked in: powernow_k8 cpufreq_ondemand cpufreq_powersave cpufreq_userspace freq_table cpufreq_conservative binfmt_misc kvm_amd kvm ipv6 vfat fat dm_crypt dm_mod k10temp 8250_pnp 8250 serial_core edac_core ohci_hcd pcspkr
[   88.079637] Pid: 1965, comm: console-kit-dae Not tainted 2.6.34-rc3-00290-g2156db9 #7
[   88.079676] Call Trace:
[   88.079713]  [<ffffffff81037ea8>] warn_slowpath_common+0x7c/0x94
[   88.079744]  [<ffffffff81037f17>] warn_slowpath_fmt+0x41/0x43
[   88.079774]  [<ffffffff810b857d>] handle_mm_fault+0x56/0x67c
[   88.079805]  [<ffffffff8101f392>] do_page_fault+0x30b/0x32d
[   88.079838]  [<ffffffff810615ce>] ? put_lock_stats+0xe/0x27
[   88.079866]  [<ffffffff81062a55>] ? lock_release_holdtime+0x104/0x109
[   88.079898]  [<ffffffff813f93e3>] ? error_sti+0x5/0x6
[   88.079929]  [<ffffffff813f7de2>] ? trace_hardirqs_off_thunk+0x3a/0x3c
[   88.079960]  [<ffffffff813f91ff>] page_fault+0x1f/0x30
[   88.079988] ---[ end trace 154dd7f6249e1cc3 ]---

and then sysfs triggered that lockdep circular locking warning - I
thought it was fixed already :(


[  256.831204] =======================================================
[  256.831210] [ INFO: possible circular locking dependency detected ]
[  256.831216] 2.6.34-rc3-00290-g2156db9 #7
[  256.831221] -------------------------------------------------------
[  256.831226] hib.sh/2464 is trying to acquire lock:
[  256.831231]  (s_active#80){++++.+}, at: [<ffffffff81127412>] sysfs_addrm_finish+0x36/0x5f
[  256.831250] 
[  256.831252] but task is already holding lock:
[  256.831256]  (&per_cpu(cpu_policy_rwsem, cpu)){+++++.}, at: [<ffffffff8131bb52>] lock_policy_rwsem_write+0x4f/0x80
[  256.831271] 
[  256.831273] which lock already depends on the new lock.
[  256.831275] 
[  256.831278] 
[  256.831280] the existing dependency chain (in reverse order) is:
[  256.831284] 
[  256.831286] -> #1 (&per_cpu(cpu_policy_rwsem, cpu)){+++++.}:
[  256.831294]        [<ffffffff8106790a>] __lock_acquire+0x1306/0x169f
[  256.831305]        [<ffffffff81067d95>] lock_acquire+0xf2/0x118
[  256.831314]        [<ffffffff813f727a>] down_read+0x4c/0x91
[  256.831323]        [<ffffffff8131c9f3>] lock_policy_rwsem_read+0x4f/0x80
[  256.831332]        [<ffffffff8131ca5c>] show+0x38/0x71
[  256.831341]        [<ffffffff81125ef0>] sysfs_read_file+0xb9/0x13e
[  256.831348]        [<ffffffff810d5901>] vfs_read+0xaf/0x150
[  256.831357]        [<ffffffff810d5a65>] sys_read+0x4a/0x71
[  256.831364]        [<ffffffff810021db>] system_call_fastpath+0x16/0x1b
[  256.831375] 
[  256.831376] -> #0 (s_active#80){++++.+}:
[  256.831385]        [<ffffffff810675c1>] __lock_acquire+0xfbd/0x169f
[  256.831385]        [<ffffffff81067d95>] lock_acquire+0xf2/0x118
[  256.831385]        [<ffffffff81126a79>] sysfs_deactivate+0x91/0xe6
[  256.831385]        [<ffffffff81127412>] sysfs_addrm_finish+0x36/0x5f
[  256.831385]        [<ffffffff81127504>] sysfs_remove_dir+0x7a/0x8d
[  256.831385]        [<ffffffff8118522e>] kobject_del+0x16/0x37
[  256.831385]        [<ffffffff8118528d>] kobject_release+0x3e/0x66
[  256.831385]        [<ffffffff811860d9>] kref_put+0x43/0x4d
[  256.831385]        [<ffffffff811851a9>] kobject_put+0x47/0x4b
[  256.831385]        [<ffffffff8131ba68>] __cpufreq_remove_dev+0x1e5/0x241
[  256.831385]        [<ffffffff813f4e33>] cpufreq_cpu_callback+0x67/0x7f
[  256.831385]        [<ffffffff8105846b>] notifier_call_chain+0x37/0x63
[  256.831385]        [<ffffffff81058505>] __raw_notifier_call_chain+0xe/0x10
[  256.831385]        [<ffffffff813e6091>] _cpu_down+0x98/0x2a6
[  256.831385]        [<ffffffff810396b1>] disable_nonboot_cpus+0x74/0x10d
[  256.831385]        [<ffffffff81075ac9>] hibernation_snapshot+0xac/0x1e1
[  256.831385]        [<ffffffff81075ccc>] hibernate+0xce/0x172
[  256.831385]        [<ffffffff81074a39>] state_store+0x5c/0xd3
[  256.831385]        [<ffffffff81184fb7>] kobj_attr_store+0x17/0x19
[  256.831385]        [<ffffffff81125dfb>] sysfs_write_file+0x108/0x144
[  256.831385]        [<ffffffff810d56c7>] vfs_write+0xb2/0x153
[  256.831385]        [<ffffffff810d582b>] sys_write+0x4a/0x71
[  256.831385]        [<ffffffff810021db>] system_call_fastpath+0x16/0x1b
[  256.831385] 
[  256.831385] other info that might help us debug this:
[  256.831385] 
[  256.831385] 6 locks held by hib.sh/2464:
[  256.831385]  #0:  (&buffer->mutex){+.+.+.}, at: [<ffffffff81125d2f>] sysfs_write_file+0x3c/0x144
[  256.831385]  #1:  (s_active#49){.+.+.+}, at: [<ffffffff81125dda>] sysfs_write_file+0xe7/0x144
[  256.831385]  #2:  (pm_mutex){+.+.+.}, at: [<ffffffff81075c1a>] hibernate+0x1c/0x172
[  256.831385]  #3:  (cpu_add_remove_lock){+.+.+.}, at: [<ffffffff810395d1>] cpu_maps_update_begin+0x17/0x19
[  256.831385]  #4:  (cpu_hotplug.lock){+.+.+.}, at: [<ffffffff81039616>] cpu_hotplug_begin+0x2c/0x53
[  256.831385]  #5:  (&per_cpu(cpu_policy_rwsem, cpu)){+++++.}, at: [<ffffffff8131bb52>] lock_policy_rwsem_write+0x4f/0x80
[  256.831385] 
[  256.831385] stack backtrace:
[  256.831385] Pid: 2464, comm: hib.sh Tainted: G        W  2.6.34-rc3-00290-g2156db9 #7
[  256.831385] Call Trace:
[  256.831385]  [<ffffffff810643c3>] print_circular_bug+0xae/0xbd
[  256.831385]  [<ffffffff810675c1>] __lock_acquire+0xfbd/0x169f
[  256.831385]  [<ffffffff81127412>] ? sysfs_addrm_finish+0x36/0x5f
[  256.831385]  [<ffffffff81067d95>] lock_acquire+0xf2/0x118
[  256.831385]  [<ffffffff81127412>] ? sysfs_addrm_finish+0x36/0x5f
[  256.831385]  [<ffffffff81126a79>] sysfs_deactivate+0x91/0xe6
[  256.831385]  [<ffffffff81127412>] ? sysfs_addrm_finish+0x36/0x5f
[  256.831385]  [<ffffffff81063d12>] ? trace_hardirqs_on+0xd/0xf
[  256.831385]  [<ffffffff81126f3d>] ? release_sysfs_dirent+0x89/0xa9
[  256.831385]  [<ffffffff81127412>] sysfs_addrm_finish+0x36/0x5f
[  256.831385]  [<ffffffff81127504>] sysfs_remove_dir+0x7a/0x8d
[  256.831385]  [<ffffffff8118522e>] kobject_del+0x16/0x37
[  256.831385]  [<ffffffff8118528d>] kobject_release+0x3e/0x66
[  256.831385]  [<ffffffff8118524f>] ? kobject_release+0x0/0x66
[  256.831385]  [<ffffffff811860d9>] kref_put+0x43/0x4d
[  256.831385]  [<ffffffff811851a9>] kobject_put+0x47/0x4b
[  256.831385]  [<ffffffff8131ba68>] __cpufreq_remove_dev+0x1e5/0x241
[  256.831385]  [<ffffffff813f4e33>] cpufreq_cpu_callback+0x67/0x7f
[  256.831385]  [<ffffffff8105846b>] notifier_call_chain+0x37/0x63
[  256.831385]  [<ffffffff81058505>] __raw_notifier_call_chain+0xe/0x10
[  256.831385]  [<ffffffff813e6091>] _cpu_down+0x98/0x2a6
[  256.831385]  [<ffffffff810396b1>] disable_nonboot_cpus+0x74/0x10d
[  256.831385]  [<ffffffff81075ac9>] hibernation_snapshot+0xac/0x1e1
[  256.831385]  [<ffffffff81075ccc>] hibernate+0xce/0x172
[  256.831385]  [<ffffffff81074a39>] state_store+0x5c/0xd3
[  256.831385]  [<ffffffff81184fb7>] kobj_attr_store+0x17/0x19
[  256.831385]  [<ffffffff81125dfb>] sysfs_write_file+0x108/0x144
[  256.831385]  [<ffffffff810d56c7>] vfs_write+0xb2/0x153
[  256.831385]  [<ffffffff81063cda>] ? trace_hardirqs_on_caller+0x120/0x14b
[  256.831385]  [<ffffffff810d582b>] sys_write+0x4a/0x71
[  256.831385]  [<ffffffff810021db>] system_call_fastpath+0x16/0x1b

-- 
Regards/Gruss,
Boris.

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA
  2010-04-08 14:11                                                                 ` Linus Torvalds
  2010-04-08 18:25                                                                   ` Rik van Riel
@ 2010-04-08 21:00                                                                   ` Borislav Petkov
  2010-04-08 23:16                                                                     ` Linus Torvalds
  1 sibling, 1 reply; 242+ messages in thread
From: Borislav Petkov @ 2010-04-08 21:00 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: KOSAKI Motohiro, Rik van Riel, Andrew Morton, Minchan Kim,
	Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin,
	Andrea Arcangeli, Hugh Dickins, sgunderson, hannes

From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Thu, Apr 08, 2010 at 07:11:11AM -0700

> [ The patch below also makes it warn once and return SIGBUS for the case 
>   where there is no anon_vma.  I decided I still want to hear about it if 
>   there might be some path that tries to insert a vma on its own ]

And this happens quite often - I changed the WARN_ONCE to WARN and can't
start kvm, iceowl (mozilla calendar) and the console-kit-daemon craps up
upon boot too:

[   55.814570] ------------[ cut here ]------------
[   55.814623] WARNING: at mm/memory.c:3110 handle_mm_fault+0x43/0x66a()
[   55.814648] Hardware name: System Product Name
[   55.814671] Mapping with no anon_vma
[   55.814693] Modules linked in: powernow_k8 cpufreq_ondemand cpufreq_powersave cpufreq_userspace freq_table cpufreq_conservative binfmt_misc kvm_amd kvm ipv6 vfat fat dm_crypt dm_mod 8250_pnp 8250 edac_core ohci_hcd serial_core k10temp pcspkr
[   55.815249] Pid: 1936, comm: console-kit-dae Not tainted 2.6.34-rc3-00290-g2156db9-dirty #8
[   55.815290] Call Trace:
[   55.815327]  [<ffffffff81037ea8>] warn_slowpath_common+0x7c/0x94
[   55.815362]  [<ffffffff81037f17>] warn_slowpath_fmt+0x41/0x43
[   55.815391]  [<ffffffff810b856a>] handle_mm_fault+0x43/0x66a
[   55.815420]  [<ffffffff8101f392>] do_page_fault+0x30b/0x32d
[   55.815452]  [<ffffffff810615ce>] ? put_lock_stats+0xe/0x27
[   55.815483]  [<ffffffff81062a55>] ? lock_release_holdtime+0x104/0x109
[   55.815518]  [<ffffffff813f93e3>] ? error_sti+0x5/0x6
[   55.815553]  [<ffffffff813f7dd2>] ? trace_hardirqs_off_thunk+0x3a/0x3c
[   55.815585]  [<ffffffff813f91ff>] page_fault+0x1f/0x30
[   55.815613] ---[ end trace fa59f67cbfeeca44 ]---
[   60.801651] ------------[ cut here ]------------
[   60.801672] WARNING: at mm/memory.c:3110 handle_mm_fault+0x43/0x66a()
[   60.801681] Hardware name: System Product Name
[   60.801689] Mapping with no anon_vma
[   60.801702] Modules linked in: powernow_k8 cpufreq_ondemand cpufreq_powersave cpufreq_userspace freq_table cpufreq_conservative binfmt_misc kvm_amd kvm ipv6 vfat fat dm_crypt dm_mod 8250_pnp 8250 edac_core ohci_hcd serial_core k10temp pcspkr
[   60.802156] Pid: 2008, comm: iceowl-bin Tainted: G        W  2.6.34-rc3-00290-g2156db9-dirty #8
[   60.802169] Call Trace:
[   60.802181]  [<ffffffff81037ea8>] warn_slowpath_common+0x7c/0x94
[   60.802191]  [<ffffffff81037f17>] warn_slowpath_fmt+0x41/0x43
[   60.802203]  [<ffffffff810b856a>] handle_mm_fault+0x43/0x66a
[   60.802213]  [<ffffffff8101f392>] do_page_fault+0x30b/0x32d
[   60.802225]  [<ffffffff810615ce>] ? put_lock_stats+0xe/0x27
[   60.802235]  [<ffffffff81062a55>] ? lock_release_holdtime+0x104/0x109
[   60.802268]  [<ffffffff813f93e3>] ? error_sti+0x5/0x6
[   60.802279]  [<ffffffff813f7dd2>] ? trace_hardirqs_off_thunk+0x3a/0x3c
[   60.802290]  [<ffffffff813f91ff>] page_fault+0x1f/0x30
[   60.802305] ---[ end trace fa59f67cbfeeca45 ]---
[   92.123350] ------------[ cut here ]------------
[   92.123402] WARNING: at kernel/sched.c:3555 add_preempt_count+0x9c/0xcb()
[   92.123428] Hardware name: System Product Name
[   92.123451] Modules linked in: powernow_k8 cpufreq_ondemand cpufreq_powersave cpufreq_userspace freq_table cpufreq_conservative binfmt_misc kvm_amd kvm ipv6 vfat fat dm_crypt dm_mod 8250_pnp 8250 edac_core ohci_hcd serial_core k10temp pcspkr
[   92.123902] Pid: 2111, comm: kvm Tainted: G        W  2.6.34-rc3-00290-g2156db9-dirty #8
[   92.123940] Call Trace:
[   92.123973]  [<ffffffff81037ea8>] warn_slowpath_common+0x7c/0x94
[   92.124002]  [<ffffffff81037ed4>] warn_slowpath_null+0x14/0x16
[   92.124031]  [<ffffffff8102d5d8>] add_preempt_count+0x9c/0xcb
[   92.124061]  [<ffffffff813f7ee9>] _raw_spin_lock_nest_lock+0x21/0x7a
[   92.124090]  [<ffffffff810bc079>] ? mm_take_all_locks+0xf9/0x150
[   92.124118]  [<ffffffff810bc079>] mm_take_all_locks+0xf9/0x150
[   92.124146]  [<ffffffff810cc48d>] ? do_mmu_notifier_register+0xd3/0x19d
[   92.124174]  [<ffffffff810cc495>] do_mmu_notifier_register+0xdb/0x19d
[   92.124202]  [<ffffffff810cc57c>] mmu_notifier_register+0x13/0x15
[   92.124256]  [<ffffffffa00c67e3>] kvm_dev_ioctl+0x2c8/0x495 [kvm]
[   92.124318]  [<ffffffff810e24ff>] vfs_ioctl+0x32/0xa6
[   92.124357]  [<ffffffff810e2a91>] do_vfs_ioctl+0x495/0x4db
[   92.124390]  [<ffffffff813f93e3>] ? error_sti+0x5/0x6
[   92.124425]  [<ffffffff813f8fad>] ? retint_swapgs+0xe/0x13
[   92.124458]  [<ffffffff810e2b1e>] sys_ioctl+0x47/0x6a
[   92.124498]  [<ffffffff810021db>] system_call_fastpath+0x16/0x1b
[   92.124527] ---[ end trace fa59f67cbfeeca46 ]---
[   92.213834] ------------[ cut here ]------------
[   92.213888] WARNING: at mm/memory.c:3110 handle_mm_fault+0x43/0x66a()
[   92.213913] Hardware name: System Product Name
[   92.213937] Mapping with no anon_vma
[   92.213959] Modules linked in: powernow_k8 cpufreq_ondemand cpufreq_powersave cpufreq_userspace freq_table cpufreq_conservative binfmt_misc kvm_amd kvm ipv6 vfat fat dm_crypt dm_mod 8250_pnp 8250 edac_core ohci_hcd serial_core k10temp pcspkr
[   92.214529] Pid: 2111, comm: kvm Tainted: G        W  2.6.34-rc3-00290-g2156db9-dirty #8
[   92.214571] Call Trace:
[   92.214612]  [<ffffffff81037ea8>] warn_slowpath_common+0x7c/0x94
[   92.214647]  [<ffffffff81037f17>] warn_slowpath_fmt+0x41/0x43
[   92.214683]  [<ffffffff810b856a>] handle_mm_fault+0x43/0x66a
[   92.214718]  [<ffffffff8101f392>] do_page_fault+0x30b/0x32d
[   92.214751]  [<ffffffff810be3ab>] ? do_mmap_pgoff+0x290/0x2f3
[   92.214787]  [<ffffffff813f93e3>] ? error_sti+0x5/0x6
[   92.214821]  [<ffffffff81062b97>] ? trace_hardirqs_off_caller+0x1f/0xa9
[   92.214857]  [<ffffffff813f7dd2>] ? trace_hardirqs_off_thunk+0x3a/0x3c
[   92.214896]  [<ffffffff813f91ff>] page_fault+0x1f/0x30
[   92.214928] ---[ end trace fa59f67cbfeeca47 ]---

-- 
Regards/Gruss,
Boris.

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA
  2010-04-08 21:00                                                                   ` Borislav Petkov
@ 2010-04-08 23:16                                                                     ` Linus Torvalds
  2010-04-08 23:47                                                                       ` Borislav Petkov
  0 siblings, 1 reply; 242+ messages in thread
From: Linus Torvalds @ 2010-04-08 23:16 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: KOSAKI Motohiro, Rik van Riel, Andrew Morton, Minchan Kim,
	Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin,
	Andrea Arcangeli, Hugh Dickins, sgunderson, hannes



On Thu, 8 Apr 2010, Borislav Petkov wrote:
> 
> And this happens quite often - I changed the WARN_ONCE to WARN and can't
> start kvm, iceowl (mozilla calendar) and the console-kit-daemon craps up
> upon boot too:

Hmm. I tried console-kit-daemon, which I had installed, but didn't get 
anything like that. Probably some setup difference.

I also went through every user of 'vm_area_cachep', and saw nothing 
suspicious at least for the mmu case (I didn't check the nommu.c code). I 
must have missed something.

One thing you could do is to add some more debugging info when that "no 
anon_vma" warning happens. In particular, if you still have the SLUB 
debugging on, you could try to do that

	page = virt_to_head_page(vma);
	object_err(vm_area_cachep, page, (void *)vma, "NULL anon_vma");

and it should give you _which_ routine did the kmem_cache_alloc() for the 
vma that doesn't have an anon_vma.

		Linus

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA
  2010-04-08 23:16                                                                     ` Linus Torvalds
@ 2010-04-08 23:47                                                                       ` Borislav Petkov
  2010-04-09  0:50                                                                         ` Linus Torvalds
  0 siblings, 1 reply; 242+ messages in thread
From: Borislav Petkov @ 2010-04-08 23:47 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: KOSAKI Motohiro, Rik van Riel, Andrew Morton, Minchan Kim,
	Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin,
	Andrea Arcangeli, Hugh Dickins, sgunderson, hannes

From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Thu, Apr 08, 2010 at 04:16:23PM -0700

> > And this happens quite often - I changed the WARN_ONCE to WARN and can't
> > start kvm, iceowl (mozilla calendar) and the console-kit-daemon craps up
> > upon boot too:
> 
> Hmm. I tried console-kit-daemon, which I had installed, but didn't get 
> anything like that. Probably some setup difference.
> 
> I also went through every user of 'vm_area_cachep', and saw nothing 
> suspicious at least for the mmu case (I didn't check the nommu.c code). I 
> must have missed something.
> 
> One thing you could do is to add some more debugging info when that "no 
> anon_vma" warning happens. In particular, if you still have the SLUB 
> debugging on, you could try to do that
> 
> 	page = virt_to_head_page(vma);
> 	object_err(vm_area_cachep, page, (void *)vma, "NULL anon_vma");
> 
> and it should give you _which_ routine did the kmem_cache_alloc() for the 
> vma that doesn't have an anon_vma.

Yep, looks good: its mmap_region()...


[   88.237326] ------------[ cut here ]------------
[   88.237377] WARNING: at mm/memory.c:3110 handle_mm_fault+0x43/0x6ab()
[   88.237403] Hardware name: System Product Name
[   88.237428] Mapping with no anon_vma
[   88.237451] Modules linked in: powernow_k8 cpufreq_ondemand cpufreq_powersave cpufreq_userspace freq_table cpufreq_conservative binfmt_misc kvm_amd kvm ipv6 vfat fat dm_crypt dm_mod 8250_pnp 8250 ohci_hcd edac_core serial_core pcspkr k10temp
[   88.237938] Pid: 1978, comm: console-kit-dae Not tainted 2.6.34-rc3-00290-g2156db9-dirty #9
[   88.237980] Call Trace:
[   88.239269]  [<ffffffff81037ec0>] warn_slowpath_common+0x7c/0x94
[   88.239320]  [<ffffffff81037f2f>] warn_slowpath_fmt+0x41/0x43
[   88.239378]  [<ffffffff810b8582>] handle_mm_fault+0x43/0x6ab
[   88.239440]  [<ffffffff8101f3b2>] do_page_fault+0x30b/0x32d
[   88.239471]  [<ffffffff810615e6>] ? put_lock_stats+0xe/0x27
[   88.239517]  [<ffffffff81062a6d>] ? lock_release_holdtime+0x104/0x109
[   88.239548]  [<ffffffff813f9463>] ? error_sti+0x5/0x6
[   88.239597]  [<ffffffff813f7e52>] ? trace_hardirqs_off_thunk+0x3a/0x3c
[   88.239626]  [<ffffffff813f927f>] page_fault+0x1f/0x30
[   88.239674] ---[ end trace 42d53170a0d3ccef ]---
[   88.239699] =============================================================================
[   88.239750] BUG vm_area_struct: NULL anon_vma
[   88.239790] -----------------------------------------------------------------------------
[   88.239794] 
[   88.239805] INFO: Allocated in mmap_region+0x23d/0x500 age=2 cpu=0 pid=1978
[   88.239815] INFO: Slab 0xffffea0007a0f0e8 objects=17 used=1 fp=0xffff88022dfbb0f0 flags=0x80000000000000c2
[   88.239823] INFO: Object 0xffff88022dfbb000 @offset=0 fp=0xffff88022dfbb0f0
[   88.239827] 
[   88.239832]   Object 0xffff88022dfbb000:  00 32 53 2b 02 88 ff ff 00 20 ab 29 d1 7f 00 00 .2S+..ÿÿ..«)Ñ...
[   88.239861]   Object 0xffff88022dfbb010:  00 30 ac 29 d1 7f 00 00 e0 81 2b 2c 02 88 ff ff .0¬)Ñ...à.+,..ÿÿ
[   88.239886]   Object 0xffff88022dfbb020:  25 00 00 00 00 00 00 80 73 00 10 00 00 00 00 00 %.......s.......
[   88.239910]   Object 0xffff88022dfbb030:  10 82 2b 2c 02 88 ff ff 00 00 00 00 00 00 00 00 ..+,..ÿÿ........
[   88.239966]   Object 0xffff88022dfbb040:  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
[   88.240016]   Object 0xffff88022dfbb050:  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
[   88.240077]   Object 0xffff88022dfbb060:  00 00 00 00 00 00 00 00 10 a0 1c 2c 02 88 ff ff ...........,..ÿÿ
[   88.240160]   Object 0xffff88022dfbb070:  10 a0 1c 2c 02 88 ff ff 00 00 00 00 00 00 00 00 ...,..ÿÿ........
[   88.240225]   Object 0xffff88022dfbb080:  00 00 00 00 00 00 00 00 b2 9a 12 fd 07 00 00 00 ........²..ý....
[   88.240294]   Object 0xffff88022dfbb090:  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
[   88.240352]   Object 0xffff88022dfbb0a0:  00 00 00 00 00 00 00 00                         ........        
[   88.240442]  Redzone 0xffff88022dfbb0a8:  cc cc cc cc cc cc cc cc                         ÌÌÌÌÌÌÌÌ        
[   88.240509]  Padding 0xffff88022dfbb0e8:  5a 5a 5a 5a 5a 5a 5a 5a                         ZZZZZZZZ        
[   88.240567] Pid: 1978, comm: console-kit-dae Tainted: G        W  2.6.34-rc3-00290-g2156db9-dirty #9
[   88.240578] Call Trace:
[   88.240593]  [<ffffffff810cd802>] print_trailer+0x139/0x142
[   88.240607]  [<ffffffff810cd845>] object_err+0x3a/0x42
[   88.240617]  [<ffffffff810b85e2>] handle_mm_fault+0xa3/0x6ab
[   88.240641]  [<ffffffff8101f3b2>] do_page_fault+0x30b/0x32d
[   88.240652]  [<ffffffff810615e6>] ? put_lock_stats+0xe/0x27
[   88.240663]  [<ffffffff81062a6d>] ? lock_release_holdtime+0x104/0x109
[   88.240685]  [<ffffffff813f9463>] ? error_sti+0x5/0x6
[   88.240695]  [<ffffffff813f7e52>] ? trace_hardirqs_off_thunk+0x3a/0x3c
[   88.240707]  [<ffffffff813f927f>] page_fault+0x1f/0x30
[   93.841666] ------------[ cut here ]------------
[   93.841716] WARNING: at mm/memory.c:3110 handle_mm_fault+0x43/0x6ab()
[   93.841741] Hardware name: System Product Name
[   93.841766] Mapping with no anon_vma
[   93.841793] Modules linked in: powernow_k8 cpufreq_ondemand cpufreq_powersave cpufreq_userspace freq_table cpufreq_conservative binfmt_misc kvm_amd kvm ipv6 vfat fat dm_crypt dm_mod 8250_pnp 8250 ohci_hcd edac_core serial_core pcspkr k10temp
[   93.842339] Pid: 2050, comm: iceowl-bin Tainted: G        W  2.6.34-rc3-00290-g2156db9-dirty #9
[   93.842383] Call Trace:
[   93.842424]  [<ffffffff81037ec0>] warn_slowpath_common+0x7c/0x94
[   93.842457]  [<ffffffff81037f2f>] warn_slowpath_fmt+0x41/0x43
[   93.842492]  [<ffffffff810b8582>] handle_mm_fault+0x43/0x6ab
[   93.842527]  [<ffffffff8101f3b2>] do_page_fault+0x30b/0x32d
[   93.842561]  [<ffffffff810615e6>] ? put_lock_stats+0xe/0x27
[   93.842593]  [<ffffffff81062a6d>] ? lock_release_holdtime+0x104/0x109
[   93.842627]  [<ffffffff813f9463>] ? error_sti+0x5/0x6
[   93.842660]  [<ffffffff813f7e52>] ? trace_hardirqs_off_thunk+0x3a/0x3c
[   93.842694]  [<ffffffff813f927f>] page_fault+0x1f/0x30
[   93.842724] ---[ end trace 42d53170a0d3ccf0 ]---
[   93.842750] =============================================================================
[   93.842794] BUG vm_area_struct: NULL anon_vma
[   93.842822] -----------------------------------------------------------------------------
[   93.842827] 
[   93.842889] INFO: Allocated in mmap_region+0x23d/0x500 age=1 cpu=2 pid=2050
[   93.842918] INFO: Slab 0xffffea00079b84b8 objects=17 used=7 fp=0xffff88022c6f1690 flags=0x80000000000000c2
[   93.842961] INFO: Object 0xffff88022c6f15a0 @offset=1440 fp=0xffff88022c6f1690
[   93.842965] 
[   93.843005] Bytes b4 0xffff88022c6f1590:  48 d9 fc ff 00 00 00 00 5a 5a 5a 5a 5a 5a 5a 5a HÙüÿ....ZZZZZZZZ
[   93.843466]   Object 0xffff88022c6f15a0:  00 78 b4 2e 02 88 ff ff 00 80 ce 49 5f 7f 00 00 .x´...ÿÿ..ÎI_...
[   93.843877]   Object 0xffff88022c6f15b0:  00 90 4e 4a 5f 7f 00 00 c0 13 6f 2c 02 88 ff ff ..NJ_...À.o,..ÿÿ
[   93.844391]   Object 0xffff88022c6f15c0:  25 00 00 00 00 00 00 80 73 00 10 00 00 00 00 00 %.......s.......
[   93.844794]   Object 0xffff88022c6f15d0:  e0 94 4a 2c 02 88 ff ff 00 00 00 00 00 00 00 00 à.J,..ÿÿ........
[   93.845198]   Object 0xffff88022c6f15e0:  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
[   93.845665]   Object 0xffff88022c6f15f0:  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
[   93.846076]   Object 0xffff88022c6f1600:  00 00 00 00 00 00 00 00 30 2d ec 2a 02 88 ff ff ........0-ì*..ÿÿ
[   93.846518]   Object 0xffff88022c6f1610:  30 2d ec 2a 02 88 ff ff 00 00 00 00 00 00 00 00 0-ì*..ÿÿ........
[   93.846931]   Object 0xffff88022c6f1620:  00 00 00 00 00 00 00 00 e8 9c f4 f5 07 00 00 00 ........è.ôõ....
[   93.847372]   Object 0xffff88022c6f1630:  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
[   93.847787]   Object 0xffff88022c6f1640:  00 00 00 00 00 00 00 00                         ........        
[   93.848194]  Redzone 0xffff88022c6f1648:  cc cc cc cc cc cc cc cc                         ÌÌÌÌÌÌÌÌ        
[   93.848635]  Padding 0xffff88022c6f1688:  5a 5a 5a 5a 5a 5a 5a 5a                         ZZZZZZZZ        
[   93.849036] Pid: 2050, comm: iceowl-bin Tainted: G        W  2.6.34-rc3-00290-g2156db9-dirty #9
[   93.849078] Call Trace:
[   93.849111]  [<ffffffff810cd802>] print_trailer+0x139/0x142
[   93.849142]  [<ffffffff810cd845>] object_err+0x3a/0x42
[   93.849174]  [<ffffffff810b85e2>] handle_mm_fault+0xa3/0x6ab
[   93.849204]  [<ffffffff8101f3b2>] do_page_fault+0x30b/0x32d
[   93.849237]  [<ffffffff810615e6>] ? put_lock_stats+0xe/0x27
[   93.849301]  [<ffffffff81062a6d>] ? lock_release_holdtime+0x104/0x109
[   93.849337]  [<ffffffff813f9463>] ? error_sti+0x5/0x6
[   93.849370]  [<ffffffff813f7e52>] ? trace_hardirqs_off_thunk+0x3a/0x3c
[   93.849418]  [<ffffffff813f927f>] page_fault+0x1f/0x30


-- 
Regards/Gruss,
Boris.

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA
  2010-04-08 23:47                                                                       ` Borislav Petkov
@ 2010-04-09  0:50                                                                         ` Linus Torvalds
  2010-04-09  1:30                                                                           ` Borislav Petkov
  2010-04-09  1:45                                                                           ` KOSAKI Motohiro
  0 siblings, 2 replies; 242+ messages in thread
From: Linus Torvalds @ 2010-04-09  0:50 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: KOSAKI Motohiro, Rik van Riel, Andrew Morton, Minchan Kim,
	Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin,
	Andrea Arcangeli, Hugh Dickins, sgunderson, hannes



On Fri, 9 Apr 2010, Borislav Petkov wrote:
> 
> Yep, looks good: its mmap_region()...

Can you double-check your current diffs - maybe something got corrupted.

mmap_region installs the vma with vma_link(), and the last thing 
vma_link() does with my patch is that "anon_vma_prepare()".

Maybe with all the patches flying around, you had a reject or something, 
and you lost that one anon_vma_prepare()?

Or maybe I screwed up somewhere and sent you the wrong patch. Here it is 
again, just in case.

[ I have a horrible cold, and can hardly think straight. So who knows, 
  maybe I'm missing something. But if you have lost one of the 
  'anon_vma_prepare()' call sites, that would certainly explain why you 
  get NULL anon_vma's ]

		Linus

---
 mm/memory.c |   12 +++---------
 mm/mmap.c   |   17 ++++-------------
 2 files changed, 7 insertions(+), 22 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 833952d..08d4423 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2223,9 +2223,6 @@ reuse:
 gotten:
 	pte_unmap_unlock(page_table, ptl);
 
-	if (unlikely(anon_vma_prepare(vma)))
-		goto oom;
-
 	if (is_zero_pfn(pte_pfn(orig_pte))) {
 		new_page = alloc_zeroed_user_highpage_movable(vma, address);
 		if (!new_page)
@@ -2766,8 +2763,6 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	/* Allocate our own private page. */
 	pte_unmap(page_table);
 
-	if (unlikely(anon_vma_prepare(vma)))
-		goto oom;
 	page = alloc_zeroed_user_highpage_movable(vma, address);
 	if (!page)
 		goto oom;
@@ -2863,10 +2858,6 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (flags & FAULT_FLAG_WRITE) {
 		if (!(vma->vm_flags & VM_SHARED)) {
 			anon = 1;
-			if (unlikely(anon_vma_prepare(vma))) {
-				ret = VM_FAULT_OOM;
-				goto out;
-			}
 			page = alloc_page_vma(GFP_HIGHUSER_MOVABLE,
 						vma, address);
 			if (!page) {
@@ -3115,6 +3106,9 @@ int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	pmd_t *pmd;
 	pte_t *pte;
 
+	if (WARN_ONCE(!vma->anon_vma, "Mapping with no anon_vma"))
+		return VM_FAULT_SIGBUS;
+
 	__set_current_state(TASK_RUNNING);
 
 	count_vm_event(PGFAULT);
diff --git a/mm/mmap.c b/mm/mmap.c
index 75557c6..82392c2 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -463,6 +463,8 @@ static void vma_link(struct mm_struct *mm, struct vm_area_struct *vma,
 
 	mm->map_count++;
 	validate_mm(mm);
+
+	anon_vma_prepare(vma);
 }
 
 /*
@@ -628,6 +630,8 @@ again:			remove_next = 1 + (end > next->vm_end);
 	if (mapping)
 		spin_unlock(&mapping->i_mmap_lock);
 
+	anon_vma_prepare(vma);
+
 	if (remove_next) {
 		if (file) {
 			fput(file);
@@ -1674,12 +1678,6 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address)
 	if (!(vma->vm_flags & VM_GROWSUP))
 		return -EFAULT;
 
-	/*
-	 * We must make sure the anon_vma is allocated
-	 * so that the anon_vma locking is not a noop.
-	 */
-	if (unlikely(anon_vma_prepare(vma)))
-		return -ENOMEM;
 	anon_vma_lock(vma);
 
 	/*
@@ -1720,13 +1718,6 @@ static int expand_downwards(struct vm_area_struct *vma,
 {
 	int error;
 
-	/*
-	 * We must make sure the anon_vma is allocated
-	 * so that the anon_vma locking is not a noop.
-	 */
-	if (unlikely(anon_vma_prepare(vma)))
-		return -ENOMEM;
-
 	address &= PAGE_MASK;
 	error = security_file_mmap(NULL, 0, 0, 0, address, 1);
 	if (error)

^ permalink raw reply related	[flat|nested] 242+ messages in thread

* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA
  2010-04-09  0:50                                                                         ` Linus Torvalds
@ 2010-04-09  1:30                                                                           ` Borislav Petkov
  2010-04-09  9:21                                                                             ` Borislav Petkov
  2010-04-09  1:45                                                                           ` KOSAKI Motohiro
  1 sibling, 1 reply; 242+ messages in thread
From: Borislav Petkov @ 2010-04-09  1:30 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: KOSAKI Motohiro, Rik van Riel, Andrew Morton, Minchan Kim,
	Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin,
	Andrea Arcangeli, Hugh Dickins, sgunderson, hannes

From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Thu, Apr 08, 2010 at 05:50:21PM -0700

> > Yep, looks good: its mmap_region()...
> 
> Can you double-check your current diffs - maybe something got corrupted.
> 
> mmap_region installs the vma with vma_link(), and the last thing 
> vma_link() does with my patch is that "anon_vma_prepare()".

Right, it looks like it. I'll add some more debugging calls there
tomorrow - it might give us more clues in case someone hasn't caught it
until then.

> Maybe with all the patches flying around, you had a reject or something, 
> and you lost that one anon_vma_prepare()?
> 
> Or maybe I screwed up somewhere and sent you the wrong patch. Here it is 
> again, just in case.

Doesn't look like it - here's the diff between yours and what I have
applied here (yep, only minor fuzz but no code differences) Also, I've
added my version at the end:

--- a.diff      2010-04-09 03:03:35.000000000 +0200
+++ b.diff      2010-04-09 03:03:52.000000000 +0200
@@ -1,8 +1,8 @@
 diff --git a/mm/memory.c b/mm/memory.c
-index 1d2ea39..bd7ea7f 100644
+index 833952d..08d4423 100644
 --- a/mm/memory.c
 +++ b/mm/memory.c
-@@ -2224,9 +2224,6 @@ reuse:
+@@ -2223,9 +2223,6 @@ reuse:
  gotten:
        pte_unmap_unlock(page_table, ptl);
  
@@ -12,7 +12,7 @@ index 1d2ea39..bd7ea7f 100644
        if (is_zero_pfn(pte_pfn(orig_pte))) {
                new_page = alloc_zeroed_user_highpage_movable(vma, address);
                if (!new_page)
-@@ -2767,8 +2764,6 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
+@@ -2766,8 +2763,6 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
        /* Allocate our own private page. */
        pte_unmap(page_table);
  
@@ -21,7 +21,7 @@ index 1d2ea39..bd7ea7f 100644
        page = alloc_zeroed_user_highpage_movable(vma, address);
        if (!page)
                goto oom;
-@@ -2864,10 +2859,6 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
+@@ -2863,10 +2858,6 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
        if (flags & FAULT_FLAG_WRITE) {
                if (!(vma->vm_flags & VM_SHARED)) {
                        anon = 1;
@@ -32,7 +32,7 @@ index 1d2ea39..bd7ea7f 100644
                        page = alloc_page_vma(GFP_HIGHUSER_MOVABLE,
                                                vma, address);
                        if (!page) {
-@@ -3116,6 +3107,9 @@ int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
+@@ -3115,6 +3106,9 @@ int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
        pmd_t *pmd;
        pte_t *pte;
  
@@ -43,7 +43,7 @@ index 1d2ea39..bd7ea7f 100644
  
        count_vm_event(PGFAULT);
 diff --git a/mm/mmap.c b/mm/mmap.c
-index bf0600c..4592a93 100644
+index 75557c6..82392c2 100644
 --- a/mm/mmap.c
 +++ b/mm/mmap.c
 @@ -463,6 +463,8 @@ static void vma_link(struct mm_struct *mm, struct vm_area_struct *vma,

> [ I have a horrible cold, and can hardly think straight. So who knows, 
>   maybe I'm missing something. But if you have lost one of the 
>   'anon_vma_prepare()' call sites, that would certainly explain why you 
>   get NULL anon_vma's ]

Oh, sorry to hear that. Ok, let's stop for today - it is 3am here and
even if some would say, "well, this is just getting interesting" :), I
think it would be best to "sleep on it." :)

Thanks.

--
commit 2156db98fd84d07e3b86564f429fcc8c6b7d61df
Author: Linus Torvalds <torvalds@linux-foundation.org>
Date:   Thu Apr 8 22:09:53 2010 +0200

    rmap: preallocate anon VMAs
    
    On Thu, 8 Apr 2010, Borislav Petkov wrote:
    >
    > There are still issues: vma_adjust() grabs mapping->i_mmap_lock for file
    > mappings while we might sleep in anon_vma_prepare():
    
    Ahh. Good catch. So I can't actually do that anon_vma_prepare() thing in
    __insert_vm_struct.
    
    It should be simple enough to just move it into the caller, just after it
    releases that lock. There's only one user of that __insert_vm_struct()
    anyway. You can do it yourself, or you can replace my previous patch with
    this..
    
    [ The patch below also makes it warn once and return SIGBUS for the case
      where there is no anon_vma.  I decided I still want to hear about it if
      there might be some path that tries to insert a vma on its own ]
    
    		Linus

diff --git a/mm/memory.c b/mm/memory.c
index 1d2ea39..bd7ea7f 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2224,9 +2224,6 @@ reuse:
 gotten:
 	pte_unmap_unlock(page_table, ptl);
 
-	if (unlikely(anon_vma_prepare(vma)))
-		goto oom;
-
 	if (is_zero_pfn(pte_pfn(orig_pte))) {
 		new_page = alloc_zeroed_user_highpage_movable(vma, address);
 		if (!new_page)
@@ -2767,8 +2764,6 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	/* Allocate our own private page. */
 	pte_unmap(page_table);
 
-	if (unlikely(anon_vma_prepare(vma)))
-		goto oom;
 	page = alloc_zeroed_user_highpage_movable(vma, address);
 	if (!page)
 		goto oom;
@@ -2864,10 +2859,6 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (flags & FAULT_FLAG_WRITE) {
 		if (!(vma->vm_flags & VM_SHARED)) {
 			anon = 1;
-			if (unlikely(anon_vma_prepare(vma))) {
-				ret = VM_FAULT_OOM;
-				goto out;
-			}
 			page = alloc_page_vma(GFP_HIGHUSER_MOVABLE,
 						vma, address);
 			if (!page) {
@@ -3116,6 +3107,9 @@ int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	pmd_t *pmd;
 	pte_t *pte;
 
+	if (WARN_ONCE(!vma->anon_vma, "Mapping with no anon_vma"))
+		return VM_FAULT_SIGBUS;
+
 	__set_current_state(TASK_RUNNING);
 
 	count_vm_event(PGFAULT);
diff --git a/mm/mmap.c b/mm/mmap.c
index bf0600c..4592a93 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -463,6 +463,8 @@ static void vma_link(struct mm_struct *mm, struct vm_area_struct *vma,
 
 	mm->map_count++;
 	validate_mm(mm);
+
+	anon_vma_prepare(vma);
 }
 
 /*
@@ -628,6 +630,8 @@ again:			remove_next = 1 + (end > next->vm_end);
 	if (mapping)
 		spin_unlock(&mapping->i_mmap_lock);
 
+	anon_vma_prepare(vma);
+
 	if (remove_next) {
 		if (file) {
 			fput(file);
@@ -1674,12 +1678,6 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address)
 	if (!(vma->vm_flags & VM_GROWSUP))
 		return -EFAULT;
 
-	/*
-	 * We must make sure the anon_vma is allocated
-	 * so that the anon_vma locking is not a noop.
-	 */
-	if (unlikely(anon_vma_prepare(vma)))
-		return -ENOMEM;
 	anon_vma_lock(vma);
 
 	/*
@@ -1720,13 +1718,6 @@ static int expand_downwards(struct vm_area_struct *vma,
 {
 	int error;
 
-	/*
-	 * We must make sure the anon_vma is allocated
-	 * so that the anon_vma locking is not a noop.
-	 */
-	if (unlikely(anon_vma_prepare(vma)))
-		return -ENOMEM;
-
 	address &= PAGE_MASK;
 	error = security_file_mmap(NULL, 0, 0, 0, address, 1);
 	if (error)

-- 
Regards/Gruss,
    Boris.

^ permalink raw reply related	[flat|nested] 242+ messages in thread

* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA
  2010-04-09  0:50                                                                         ` Linus Torvalds
  2010-04-09  1:30                                                                           ` Borislav Petkov
@ 2010-04-09  1:45                                                                           ` KOSAKI Motohiro
  1 sibling, 0 replies; 242+ messages in thread
From: KOSAKI Motohiro @ 2010-04-09  1:45 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: kosaki.motohiro, Borislav Petkov, Rik van Riel, Andrew Morton,
	Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn,
	Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson, hannes

> 
> 
> On Fri, 9 Apr 2010, Borislav Petkov wrote:
> > 
> > Yep, looks good: its mmap_region()...
> 
> Can you double-check your current diffs - maybe something got corrupted.
> 
> mmap_region installs the vma with vma_link(), and the last thing 
> vma_link() does with my patch is that "anon_vma_prepare()".

I agree. and at least your patch works fine on my box. I'll continue digg.



> 
> Maybe with all the patches flying around, you had a reject or something, 
> and you lost that one anon_vma_prepare()?
> 
> Or maybe I screwed up somewhere and sent you the wrong patch. Here it is 
> again, just in case.
> 
> [ I have a horrible cold, and can hardly think straight. So who knows, 
>   maybe I'm missing something. But if you have lost one of the 
>   'anon_vma_prepare()' call sites, that would certainly explain why you 
>   get NULL anon_vma's ]
> 
> 		Linus



^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA
  2010-04-09  1:30                                                                           ` Borislav Petkov
@ 2010-04-09  9:21                                                                             ` Borislav Petkov
  2010-04-09 16:35                                                                               ` Linus Torvalds
  0 siblings, 1 reply; 242+ messages in thread
From: Borislav Petkov @ 2010-04-09  9:21 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: KOSAKI Motohiro, Rik van Riel, Andrew Morton, Minchan Kim,
	Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin,
	Andrea Arcangeli, Hugh Dickins, sgunderson, hannes

From: Borislav Petkov <bp@alien8.de>
Date: Fri, Apr 09, 2010 at 03:30:12AM +0200

> > Maybe with all the patches flying around, you had a reject or something, 
> > and you lost that one anon_vma_prepare()?
> > 
> > Or maybe I screwed up somewhere and sent you the wrong patch. Here it is 
> > again, just in case.
> 
> Doesn't look like it - here's the diff between yours and what I have
> applied here (yep, only minor fuzz but no code differences) Also, I've
> added my version at the end:

So I went and reapplied the three patches (3rd is the object_err export
for SLUB debugging) on a new branch of today's git - same results, the
same processes crap up in the WARN(!vma->anon_vma) check so it should be
something else we're missing.

More code staring later...

-- 
Regards/Gruss,
Boris.

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA
  2010-04-09  9:21                                                                             ` Borislav Petkov
@ 2010-04-09 16:35                                                                               ` Linus Torvalds
  2010-04-09 17:40                                                                                 ` Borislav Petkov
  0 siblings, 1 reply; 242+ messages in thread
From: Linus Torvalds @ 2010-04-09 16:35 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: KOSAKI Motohiro, Rik van Riel, Andrew Morton, Minchan Kim,
	Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin,
	Andrea Arcangeli, Hugh Dickins, sgunderson, hannes



On Fri, 9 Apr 2010, Borislav Petkov wrote:
> 
> So I went and reapplied the three patches (3rd is the object_err export
> for SLUB debugging) on a new branch of today's git - same results, the
> same processes crap up in the WARN(!vma->anon_vma) check so it should be
> something else we're missing.
> 
> More code staring later...

Can you try with _just_ my patch? Or add a 

	vma->anon_vma = merge_vma->anon_vma;

to Rik's "merge_vma" case in anon_vma_prepare().

Because I'm starign at Rik's patch, and one thing strikes me: it does that 
"anon_vma_clone()" in anon_vma_prepare(), and maybe I'm blind, but I don't 
see where that actually sets vma->anon_vma.

As far as I can tell, anon_vma_clone() was designed purely for the fork() 
case, which has done

	*new = *vma;

which will set new->anon_vma to the same vma. But Rik's patch never does 
that for the anon_vma_prepare() case.

And maybe we should do it in anon_vma_clone() itself, just to make it 
impossible to mistakenly leave it out, the way I think Rik's patch did.

Anyway, I'm still groggy from allt he flu medication, so take everything I 
say with a grain of salt.

In fact, the more I look at this, the less I think I like Rik's patch in 
the first place. I think the real bug that Rik tried to fix is that 
apparently anon_vma_merge() doesn't necessarily merge everything right. 
>From Rik's bug-explanation, step 5:

>> 5) vma_adjust calls anon_vma_merge, causing the anon_vma
>>    chain of one of the VMAs to get nuked - with bad luck,
>>    this is the original one, leaving just the new anon_vma
>>   attached to the VMA

and I think that _this_ is the real bug to begin with. The real fix should 
be in vma_adjust/anon_vma_merge, not in how we set up the anon_vma in the 
first place. I do _not_ think we should require that we always merged 
things at mmap() time, because we may _never_ be able to merge perfectly 
(ie start out with to disjoing mmaps, and fill in the middle).

			Linus

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA
  2010-04-09 16:35                                                                               ` Linus Torvalds
@ 2010-04-09 17:40                                                                                 ` Borislav Petkov
  2010-04-09 17:50                                                                                   ` Linus Torvalds
  0 siblings, 1 reply; 242+ messages in thread
From: Borislav Petkov @ 2010-04-09 17:40 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: KOSAKI Motohiro, Rik van Riel, Andrew Morton, Minchan Kim,
	Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin,
	Andrea Arcangeli, Hugh Dickins, sgunderson, hannes

From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Fri, Apr 09, 2010 at 09:35:15AM -0700

> Can you try with _just_ my patch?

Yep, yours along with the SLUB debugging piece just survived one
hibernation cycle without a problem. Also, no SIGBUS-killed processes,
all seems fine. Will continue stressing it though...

Let me know what you want me to do next.

-- 
Regards/Gruss,
Boris.

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA
  2010-04-09 17:40                                                                                 ` Borislav Petkov
@ 2010-04-09 17:50                                                                                   ` Linus Torvalds
  2010-04-09 19:14                                                                                     ` Borislav Petkov
  0 siblings, 1 reply; 242+ messages in thread
From: Linus Torvalds @ 2010-04-09 17:50 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: KOSAKI Motohiro, Rik van Riel, Andrew Morton, Minchan Kim,
	Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin,
	Andrea Arcangeli, Hugh Dickins, sgunderson, hannes



On Fri, 9 Apr 2010, Borislav Petkov wrote:
>
> From: Linus Torvalds <torvalds@linux-foundation.org>
> Date: Fri, Apr 09, 2010 at 09:35:15AM -0700
> 
> > Can you try with _just_ my patch?
> 
> Yep, yours along with the SLUB debugging piece just survived one
> hibernation cycle without a problem. Also, no SIGBUS-killed processes,
> all seems fine. Will continue stressing it though...
> 
> Let me know what you want me to do next.

Continue stress-testing it. I don't think my patch on its own should fix 
the original problem, but at least we now know why you got those NULL 
anon_vma's.

So what I _think_ will happen is that you'll be able to re-create the 
problem that started this all.  But I'd like to verify that, just because 
I'm anal and I'd like these things to be tested independently.

So assuming that the original problem happens again, if you can then apply 
Rik's patch, but add a

	dst->anon_vma = src->anon_vma;

to just before the success case (the "return 0") in anon_vma_clone(), 
that would be good.

			Linus

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA
  2010-04-09 17:50                                                                                   ` Linus Torvalds
@ 2010-04-09 19:14                                                                                     ` Borislav Petkov
  2010-04-09 19:32                                                                                       ` Linus Torvalds
  0 siblings, 1 reply; 242+ messages in thread
From: Borislav Petkov @ 2010-04-09 19:14 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: KOSAKI Motohiro, Rik van Riel, Andrew Morton, Minchan Kim,
	Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin,
	Andrea Arcangeli, Hugh Dickins, sgunderson, hannes

From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Fri, Apr 09, 2010 at 10:50:23AM -0700

> Continue stress-testing it. I don't think my patch on its own should fix 
> the original problem, but at least we now know why you got those NULL 
> anon_vma's.
> 
> So what I _think_ will happen is that you'll be able to re-create the 
> problem that started this all.  But I'd like to verify that, just because 
> I'm anal and I'd like these things to be tested independently.

Heh, that was easy. Third hibernate cycle is a charm^Wboom :)

> So assuming that the original problem happens again, if you can then apply 
> Rik's patch, but add a
> 
> 	dst->anon_vma = src->anon_vma;
> 
> to just before the success case (the "return 0") in anon_vma_clone(), 
> that would be good.

It looks like this way we mangle the anon_vma chains somehow. From
what I can see and if I'm not mistaken, we save the anon_vmas alright
but end up in what seems like an endless list_for_each_entry()
loop having grabbed anon_vma->lock in page_lock_anon_vma() and we
can't seem to yield it through page_unlock_anon_vma() at the end of
page_referenced_anon() so it has to be that code in between iterating
over each list entry...

I could be completely wrong though...


[  373.683545] PM: Syncing filesystems ... done.
[  373.950289] Freezing user space processes ... (elapsed 0.04 seconds) done.
[  373.998878] Freezing remaining freezable tasks ... (elapsed 0.01 seconds) done.
[  374.011121] PM: Preallocating image memory... 
[  439.161126] BUG: soft lockup - CPU#1 stuck for 61s! [hib.sh:3617]
[  439.161315] Modules linked in: powernow_k8 cpufreq_ondemand cpufreq_powersave cpufreq_userspace freq_table cpufreq_conservative binfmt_misc kvm_amd kvm ipv6 vfat fat dm_crypt dm_mod 8250_pnp 8250 serial_core edac_core pcspkr k10temp ohci_hcd
[  439.162302] irq event stamp: 0
[  439.162302] hardirqs last  enabled at (0): [<(null)>] (null)
[  439.162302] hardirqs last disabled at (0): [<ffffffff8103655c>] copy_process+0x3c1/0x10cc
[  439.163297] softirqs last  enabled at (0): [<ffffffff8103655c>] copy_process+0x3c1/0x10cc
[  439.163297] softirqs last disabled at (0): [<(null)>] (null)
[  439.163297] CPU 1 
[  439.163297] Modules linked in: powernow_k8 cpufreq_ondemand cpufreq_powersave cpufreq_userspace freq_table cpufreq_conservative binfmt_misc kvm_amd kvm ipv6 vfat fat dm_crypt dm_mod 8250_pnp 8250 serial_core edac_core pcspkr k10temp ohci_hcd
[  439.165297] 
[  439.165297] Pid: 3617, comm: hib.sh Tainted: G        W  2.6.34-rc3-00413-g1028f7c-dirty #12 M3A78 PRO/System Product Name
[  439.165297] RIP: 0010:[<ffffffff8118b731>]  [<ffffffff8118b731>] delay_tsc+0x0/0xca
[  439.165297] RSP: 0018:ffff8801f68b77f0  EFLAGS: 00000202
[  439.166300] RAX: 0000000000000000 RBX: ffff8801f68b77f8 RCX: 000000000000f100
[  439.166300] RDX: 0000000000000001 RSI: ffff8801f68b7848 RDI: 0000000000000001
[  439.166300] RBP: ffffffff81002b4e R08: 0000000000000001 R09: 0000000000000000
[  439.166300] R10: ffff88022c9a3ac8 R11: ffffffff00000012 R12: 000000000000f100
[  439.166300] R13: 00000000cc444700 R14: 0000000000000001 R15: 0000000000000000
[  439.166300] FS:  00007f8d00e676f0(0000) GS:ffff88000a200000(0000) knlGS:0000000000000000
[  439.167296] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[  439.167296] CR2: 00007fff93e8a9c0 CR3: 00000001f5397000 CR4: 00000000000006e0
[  439.167296] DR0: 00000000000000a0 DR1: 0000000000000000 DR2: 0000000000000003
[  439.167296] DR3: 00000000000000b0 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[  439.167296] Process hib.sh (pid: 3617, threadinfo ffff8801f68b6000, task ffff88022a2e8000)
[  439.167296] Stack:
[  439.168297]  ffffffff8118b72f ffff8801f68b7848 ffffffff8119a1ca ffff880214972868
[  439.168297] <0> 0000000000000001 ffff880100000000 ffff880214972850 ffff880214972868
[  439.168297] <0> ffff8801f68b7cf8 ffff8801f68b7b78 ffff8801f68b7a00 ffff8801f68b7878
[  439.169298] Call Trace:
[  439.169298]  [<ffffffff8118b72f>] ? __delay+0xf/0x11
[  439.169298]  [<ffffffff8119a1ca>] ? do_raw_spin_lock+0xd2/0x13c
[  439.169298]  [<ffffffff813f827b>] ? _raw_spin_lock+0x60/0x73
[  439.170299]  [<ffffffff810c0a37>] ? page_lock_anon_vma+0x63/0xac
[  439.170299]  [<ffffffff810c0a37>] ? page_lock_anon_vma+0x63/0xac
[  439.170299]  [<ffffffff810c09d4>] ? page_lock_anon_vma+0x0/0xac
[  439.170299]  [<ffffffff810c0c1d>] ? page_referenced+0x80/0x1dc
[  439.170299]  [<ffffffff810c0b22>] ? try_to_unmap_anon+0xa2/0xb4
[  439.170299]  [<ffffffff810ab7a6>] ? shrink_page_list+0x14a/0x477
[  439.170299]  [<ffffffff813f8d86>] ? _raw_spin_unlock_irq+0x30/0x58
[  439.171296]  [<ffffffff810abe2a>] ? shrink_inactive_list+0x357/0x5e5
[  439.171296]  [<ffffffff810ab64a>] ? shrink_active_list+0x232/0x244
[  439.171296]  [<ffffffff810ac3c4>] ? shrink_zone+0x30c/0x3d6
[  439.171296]  [<ffffffff810acf9f>] ? do_try_to_free_pages+0x176/0x27f
[  439.171296]  [<ffffffff810ad13d>] ? shrink_all_memory+0x95/0xc4
[  439.171296]  [<ffffffff810aa640>] ? isolate_pages_global+0x0/0x1f0
[  439.171296]  [<ffffffff81076e60>] ? count_data_pages+0x65/0x79
[  439.172298]  [<ffffffff810770c7>] ? hibernate_preallocate_memory+0x1aa/0x2cb
[  439.172298]  [<ffffffff813f5285>] ? printk+0x41/0x44
[  439.172298]  [<ffffffff81075a67>] ? hibernation_snapshot+0x36/0x1e1
[  439.172298]  [<ffffffff81075ce0>] ? hibernate+0xce/0x172
[  439.172298]  [<ffffffff81074a4d>] ? state_store+0x5c/0xd3
[  439.172298]  [<ffffffff81184f8f>] ? kobj_attr_store+0x17/0x19
[  439.173296]  [<ffffffff81125dd7>] ? sysfs_write_file+0x108/0x144
[  439.173296]  [<ffffffff810d575f>] ? vfs_write+0xb2/0x153
[  439.173296]  [<ffffffff81063bed>] ? trace_hardirqs_on_caller+0x1f/0x14b
[  439.173296]  [<ffffffff810d58c3>] ? sys_write+0x4a/0x71
[  439.173296]  [<ffffffff810021db>] ? system_call_fastpath+0x16/0x1b
[  439.173296] Code: ff c8 c9 c3 55 48 89 e5 0f 1f 44 00 00 48 c7 05 12 35 4e 00 31 b7 18 81 c9 c3 55 48 89 e5 0f 1f 44 00 00 ff 15 01 35 4e 00 c9 c3 <55> 48 89 e5 41 57 41 56 41 55 41 54 53 48 83 ec 08 0f 1f 44 00 
[  439.176296] Call Trace:
[  439.177297]  [<ffffffff8118b72f>] ? __delay+0xf/0x11
[  439.177297]  [<ffffffff8119a1ca>] ? do_raw_spin_lock+0xd2/0x13c
[  439.177297]  [<ffffffff813f827b>] ? _raw_spin_lock+0x60/0x73
[  439.177297]  [<ffffffff810c0a37>] ? page_lock_anon_vma+0x63/0xac
[  439.177297]  [<ffffffff810c0a37>] ? page_lock_anon_vma+0x63/0xac
[  439.177297]  [<ffffffff810c09d4>] ? page_lock_anon_vma+0x0/0xac
[  439.177297]  [<ffffffff810c0c1d>] ? page_referenced+0x80/0x1dc
[  439.178295]  [<ffffffff810c0b22>] ? try_to_unmap_anon+0xa2/0xb4
[  439.178295]  [<ffffffff810ab7a6>] ? shrink_page_list+0x14a/0x477
[  439.178295]  [<ffffffff813f8d86>] ? _raw_spin_unlock_irq+0x30/0x58
[  439.178295]  [<ffffffff810abe2a>] ? shrink_inactive_list+0x357/0x5e5
[  439.178295]  [<ffffffff810ab64a>] ? shrink_active_list+0x232/0x244
[  439.178295]  [<ffffffff810ac3c4>] ? shrink_zone+0x30c/0x3d6
[  439.178295]  [<ffffffff810acf9f>] ? do_try_to_free_pages+0x176/0x27f
[  439.179299]  [<ffffffff810ad13d>] ? shrink_all_memory+0x95/0xc4
[  439.179299]  [<ffffffff810aa640>] ? isolate_pages_global+0x0/0x1f0
[  439.179299]  [<ffffffff81076e60>] ? count_data_pages+0x65/0x79
[  439.179299]  [<ffffffff810770c7>] ? hibernate_preallocate_memory+0x1aa/0x2cb
[  439.179299]  [<ffffffff813f5285>] ? printk+0x41/0x44
[  439.179299]  [<ffffffff81075a67>] ? hibernation_snapshot+0x36/0x1e1
[  439.180296]  [<ffffffff81075ce0>] ? hibernate+0xce/0x172
[  439.180296]  [<ffffffff81074a4d>] ? state_store+0x5c/0xd3
[  439.180296]  [<ffffffff81184f8f>] ? kobj_attr_store+0x17/0x19
[  439.180296]  [<ffffffff81125dd7>] ? sysfs_write_file+0x108/0x144
[  439.180296]  [<ffffffff810d575f>] ? vfs_write+0xb2/0x153
[  439.180296]  [<ffffffff81063bed>] ? trace_hardirqs_on_caller+0x1f/0x14b
[  439.180296]  [<ffffffff810d58c3>] ? sys_write+0x4a/0x71
[  439.181297]  [<ffffffff810021db>] ? system_call_fastpath+0x16/0x1b
[  504.659125] BUG: soft lockup - CPU#1 stuck for 61s! [hib.sh:3617]
[  504.659126] Modules linked in: powernow_k8 cpufreq_ondemand cpufreq_powersave cpufreq_userspace freq_table cpufreq_conservative binfmt_misc kvm_amd kvm ipv6 vfat fat dm_crypt dm_mod 8250_pnp 8250 serial_core edac_core pcspkr k10temp ohci_hcd
[  504.660297] irq event stamp: 0
[  504.660297] hardirqs last  enabled at (0): [<(null)>] (null)
[  504.660297] hardirqs last disabled at (0): [<ffffffff8103655c>] copy_process+0x3c1/0x10cc
[  504.661298] softirqs last  enabled at (0): [<ffffffff8103655c>] copy_process+0x3c1/0x10cc
[  504.661298] softirqs last disabled at (0): [<(null)>] (null)
[  504.661298] CPU 1 
[  504.661298] Modules linked in: powernow_k8 cpufreq_ondemand cpufreq_powersave cpufreq_userspace freq_table cpufreq_conservative binfmt_misc kvm_amd kvm ipv6 vfat fat dm_crypt dm_mod 8250_pnp 8250 serial_core edac_core pcspkr k10temp ohci_hcd
[  504.663297] 
[  504.663297] Pid: 3617, comm: hib.sh Tainted: G        W  2.6.34-rc3-00413-g1028f7c-dirty #12 M3A78 PRO/System Product Name
[  504.663297] RIP: 0010:[<ffffffff8118b775>]  [<ffffffff8118b775>] delay_tsc+0x44/0xca
[  504.663297] RSP: 0018:ffff8801f68b77b8  EFLAGS: 00000206
[  504.663297] RAX: 00000000a4911fed RBX: ffff8801f68b77e8 RCX: 000000000000f100
[  504.664326] RDX: 00000000000000f1 RSI: ffff8801f68b7848 RDI: 0000000000000001
[  504.664326] RBP: ffffffff81002b4e R08: 0000000000000001 R09: 0000000000000000
[  504.664326] R10: ffff88022c9a3ac8 R11: ffffffff00000012 R12: 0000000000000010
[  504.664326] R13: ffff88000a200000 R14: ffff8801f68b6000 R15: ffff8801f68b7fd8
[  504.664326] FS:  00007f8d00e676f0(0000) GS:ffff88000a200000(0000) knlGS:0000000000000000
[  504.664326] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[  504.665296] CR2: 00007fff93e8a9c0 CR3: 00000001f5397000 CR4: 00000000000006e0
[  504.665296] DR0: 00000000000000a0 DR1: 0000000000000000 DR2: 0000000000000003
[  504.665296] DR3: 00000000000000b0 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[  504.665296] Process hib.sh (pid: 3617, threadinfo ffff8801f68b6000, task ffff88022a2e8000)
[  504.665296] Stack:
[  504.665296]  0000000000000001 ffff880214972850 ffff88022a2e8000 00000000b3450160
[  504.666297] <0> ffff88022a2e83a8 000000005486e668 ffff8801f68b77f8 ffffffff8118b72f
[  504.666297] <0> ffff8801f68b7848 ffffffff8119a1ca ffff880214972868 0000000000000001
[  504.667298] Call Trace:
[  504.667298]  [<ffffffff8118b72f>] ? __delay+0xf/0x11
[  504.667298]  [<ffffffff8119a1ca>] ? do_raw_spin_lock+0xd2/0x13c
[  504.667298]  [<ffffffff813f827b>] ? _raw_spin_lock+0x60/0x73
[  504.667298]  [<ffffffff810c0a37>] ? page_lock_anon_vma+0x63/0xac
[  504.668288]  [<ffffffff810c0a37>] ? page_lock_anon_vma+0x63/0xac
[  504.668298]  [<ffffffff810c09d4>] ? page_lock_anon_vma+0x0/0xac
[  504.668298]  [<ffffffff810c0c1d>] ? page_referenced+0x80/0x1dc
[  504.668298]  [<ffffffff810c0b22>] ? try_to_unmap_anon+0xa2/0xb4
[  504.668298]  [<ffffffff810ab7a6>] ? shrink_page_list+0x14a/0x477
[  504.668298]  [<ffffffff813f8d86>] ? _raw_spin_unlock_irq+0x30/0x58
[  504.668298]  [<ffffffff810abe2a>] ? shrink_inactive_list+0x357/0x5e5
[  504.669296]  [<ffffffff810ab64a>] ? shrink_active_list+0x232/0x244
[  504.669296]  [<ffffffff810ac3c4>] ? shrink_zone+0x30c/0x3d6
[  504.669296]  [<ffffffff810acf9f>] ? do_try_to_free_pages+0x176/0x27f
[  504.669296]  [<ffffffff810ad13d>] ? shrink_all_memory+0x95/0xc4
[  504.669296]  [<ffffffff810aa640>] ? isolate_pages_global+0x0/0x1f0
[  504.669296]  [<ffffffff81076e60>] ? count_data_pages+0x65/0x79
[  504.669296]  [<ffffffff810770c7>] ? hibernate_preallocate_memory+0x1aa/0x2cb
[  504.670302]  [<ffffffff813f5285>] ? printk+0x41/0x44
[  504.670302]  [<ffffffff81075a67>] ? hibernation_snapshot+0x36/0x1e1
[  504.670302]  [<ffffffff81075ce0>] ? hibernate+0xce/0x172
[  504.670302]  [<ffffffff81074a4d>] ? state_store+0x5c/0xd3
[  504.670302]  [<ffffffff81184f8f>] ? kobj_attr_store+0x17/0x19
[  504.670302]  [<ffffffff81125dd7>] ? sysfs_write_file+0x108/0x144
[  504.670302]  [<ffffffff810d575f>] ? vfs_write+0xb2/0x153
[  504.671297]  [<ffffffff81063bed>] ? trace_hardirqs_on_caller+0x1f/0x14b
[  504.671297]  [<ffffffff810d58c3>] ? sys_write+0x4a/0x71
[  504.673315]  [<ffffffff810021db>] ? system_call_fastpath+0x16/0x1b
[  504.674350] Code: bf 01 00 00 00 e8 f8 1d ea ff e8 9f f4 00 00 41 89 c5 0f ae f0 66 66 90 0f 31 89 c3 65 4c 8b 34 25 48 b5 00 00 0f ae f0 66 66 90 <0f> 31 41 89 c7 4c 89 f8 48 29 d8 4c 39 e0 73 49 bf 01 00 00 00 
[  504.677299] Call Trace:
[  504.677299]  [<ffffffff8118b72f>] ? __delay+0xf/0x11
[  504.677299]  [<ffffffff8119a1ca>] ? do_raw_spin_lock+0xd2/0x13c
[  504.677299]  [<ffffffff813f827b>] ? _raw_spin_lock+0x60/0x73
[  504.677299]  [<ffffffff810c0a37>] ? page_lock_anon_vma+0x63/0xac
[  504.678287]  [<ffffffff810c0a37>] ? page_lock_anon_vma+0x63/0xac
[  504.678296]  [<ffffffff810c09d4>] ? page_lock_anon_vma+0x0/0xac
[  504.678296]  [<ffffffff810c0c1d>] ? page_referenced+0x80/0x1dc
[  504.678296]  [<ffffffff810c0b22>] ? try_to_unmap_anon+0xa2/0xb4
[  504.678296]  [<ffffffff810ab7a6>] ? shrink_page_list+0x14a/0x477
[  504.678296]  [<ffffffff813f8d86>] ? _raw_spin_unlock_irq+0x30/0x58
[  504.678296]  [<ffffffff810abe2a>] ? shrink_inactive_list+0x357/0x5e5
[  504.679297]  [<ffffffff810ab64a>] ? shrink_active_list+0x232/0x244
[  504.679297]  [<ffffffff810ac3c4>] ? shrink_zone+0x30c/0x3d6
[  504.679297]  [<ffffffff810acf9f>] ? do_try_to_free_pages+0x176/0x27f
[  504.679297]  [<ffffffff810ad13d>] ? shrink_all_memory+0x95/0xc4
[  504.679297]  [<ffffffff810aa640>] ? isolate_pages_global+0x0/0x1f0
[  504.679297]  [<ffffffff81076e60>] ? count_data_pages+0x65/0x79
[  504.679297]  [<ffffffff810770c7>] ? hibernate_preallocate_memory+0x1aa/0x2cb
[  504.680303]  [<ffffffff813f5285>] ? printk+0x41/0x44
[  504.680303]  [<ffffffff81075a67>] ? hibernation_snapshot+0x36/0x1e1
[  504.680303]  [<ffffffff81075ce0>] ? hibernate+0xce/0x172
[  504.680303]  [<ffffffff81074a4d>] ? state_store+0x5c/0xd3
[  504.680303]  [<ffffffff81184f8f>] ? kobj_attr_store+0x17/0x19
[  504.680303]  [<ffffffff81125dd7>] ? sysfs_write_file+0x108/0x144
[  504.680303]  [<ffffffff810d575f>] ? vfs_write+0xb2/0x153
[  504.681297]  [<ffffffff81063bed>] ? trace_hardirqs_on_caller+0x1f/0x14b
[  504.681297]  [<ffffffff810d58c3>] ? sys_write+0x4a/0x71
[  504.681297]  [<ffffffff810021db>] ? system_call_fastpath+0x16/0x1b
[  570.157125] BUG: soft lockup - CPU#1 stuck for 61s! [hib.sh:3617]
[  570.157126] Modules linked in: powernow_k8 cpufreq_ondemand cpufreq_powersave cpufreq_userspace freq_table cpufreq_conservative binfmt_misc kvm_amd kvm ipv6 vfat fat dm_crypt dm_mod 8250_pnp 8250 serial_core edac_core pcspkr k10temp ohci_hcd
[  570.158283] irq event stamp: 0
[  570.158283] hardirqs last  enabled at (0): [<(null)>] (null)
[  570.158283] hardirqs last disabled at (0): [<ffffffff8103655c>] copy_process+0x3c1/0x10cc
[  570.159297] softirqs last  enabled at (0): [<ffffffff8103655c>] copy_process+0x3c1/0x10cc
[  570.159297] softirqs last disabled at (0): [<(null)>] (null)
[  570.159297] CPU 1 
[  570.159297] Modules linked in: powernow_k8 cpufreq_ondemand cpufreq_powersave cpufreq_userspace freq_table cpufreq_conservative binfmt_misc kvm_amd kvm ipv6 vfat fat dm_crypt dm_mod 8250_pnp 8250 serial_core edac_core pcspkr k10temp ohci_hcd
[  570.161297] 
[  570.161297] Pid: 3617, comm: hib.sh Tainted: G        W  2.6.34-rc3-00413-g1028f7c-dirty #12 M3A78 PRO/System Product Name
[  570.161297] RIP: 0010:[<ffffffff8118b777>]  [<ffffffff8118b777>] delay_tsc+0x46/0xca
[  570.161297] RSP: 0018:ffff8801f68b77b8  EFLAGS: 00000206
[  570.161297] RAX: 000000007cdde43c RBX: ffff8801f68b77e8 RCX: 000000000000f100
[  570.162296] RDX: 000000000000011f RSI: ffff8801f68b7848 RDI: 0000000000000001
[  570.162296] RBP: ffffffff81002b4e R08: 0000000000000001 R09: 0000000000000000
[  570.162296] R10: ffff88022c9a3ac8 R11: ffffffff00000012 R12: 0000000000000010
[  570.162296] R13: ffff88000a200000 R14: ffff8801f68b6000 R15: ffff8801f68b7fd8
[  570.162296] FS:  00007f8d00e676f0(0000) GS:ffff88000a200000(0000) knlGS:0000000000000000
[  570.162296] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[  570.163296] CR2: 00007fff93e8a9c0 CR3: 00000001f5397000 CR4: 00000000000006e0
[  570.163296] DR0: 00000000000000a0 DR1: 0000000000000000 DR2: 0000000000000003
[  570.163296] DR3: 00000000000000b0 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[  570.163296] Process hib.sh (pid: 3617, threadinfo ffff8801f68b6000, task ffff88022a2e8000)
[  570.163296] Stack:
[  570.163296]  0000000000000001 ffff880214972850 ffff88022a2e8000 00000000b3450160
[  570.164335] <0> ffff88022a2e83a8 000000007f0025c7 ffff8801f68b77f8 ffffffff8118b72f
[  570.164335] <0> ffff8801f68b7848 ffffffff8119a1ca ffff880214972868 0000000000000001
[  570.165299] Call Trace:
[  570.165299]  [<ffffffff8118b72f>] ? __delay+0xf/0x11
[  570.165299]  [<ffffffff8119a1ca>] ? do_raw_spin_lock+0xd2/0x13c
[  570.165299]  [<ffffffff813f827b>] ? _raw_spin_lock+0x60/0x73
[  570.165299]  [<ffffffff810c0a37>] ? page_lock_anon_vma+0x63/0xac
[  570.165299]  [<ffffffff810c0a37>] ? page_lock_anon_vma+0x63/0xac
[  570.166297]  [<ffffffff810c09d4>] ? page_lock_anon_vma+0x0/0xac
[  570.166297]  [<ffffffff810c0c1d>] ? page_referenced+0x80/0x1dc
[  570.166297]  [<ffffffff810c0b22>] ? try_to_unmap_anon+0xa2/0xb4
[  570.166297]  [<ffffffff810ab7a6>] ? shrink_page_list+0x14a/0x477
[  570.166297]  [<ffffffff813f8d86>] ? _raw_spin_unlock_irq+0x30/0x58
[  570.166297]  [<ffffffff810abe2a>] ? shrink_inactive_list+0x357/0x5e5
[  570.167296]  [<ffffffff810ab64a>] ? shrink_active_list+0x232/0x244
[  570.167296]  [<ffffffff810ac3c4>] ? shrink_zone+0x30c/0x3d6
[  570.167296]  [<ffffffff810acf9f>] ? do_try_to_free_pages+0x176/0x27f
[  570.167296]  [<ffffffff810ad13d>] ? shrink_all_memory+0x95/0xc4
[  570.167296]  [<ffffffff810aa640>] ? isolate_pages_global+0x0/0x1f0
[  570.167296]  [<ffffffff81076e60>] ? count_data_pages+0x65/0x79
[  570.167296]  [<ffffffff810770c7>] ? hibernate_preallocate_memory+0x1aa/0x2cb
[  570.168286]  [<ffffffff813f5285>] ? printk+0x41/0x44
[  570.168286]  [<ffffffff81075a67>] ? hibernation_snapshot+0x36/0x1e1
[  570.168286]  [<ffffffff81075ce0>] ? hibernate+0xce/0x172
[  570.168286]  [<ffffffff81074a4d>] ? state_store+0x5c/0xd3
[  570.168286]  [<ffffffff81184f8f>] ? kobj_attr_store+0x17/0x19
[  570.168286]  [<ffffffff81125dd7>] ? sysfs_write_file+0x108/0x144
[  570.168286]  [<ffffffff810d575f>] ? vfs_write+0xb2/0x153
[  570.169297]  [<ffffffff81063bed>] ? trace_hardirqs_on_caller+0x1f/0x14b
[  570.169297]  [<ffffffff810d58c3>] ? sys_write+0x4a/0x71
[  570.169297]  [<ffffffff810021db>] ? system_call_fastpath+0x16/0x1b
[  570.169297] Code: 00 00 00 e8 f8 1d ea ff e8 9f f4 00 00 41 89 c5 0f ae f0 66 66 90 0f 31 89 c3 65 4c 8b 34 25 48 b5 00 00 0f ae f0 66 66 90 0f 31 <41> 89 c7 4c 89 f8 48 29 d8 4c 39 e0 73 49 bf 01 00 00 00 e8 07 
[  570.172299] Call Trace:
[  570.172299]  [<ffffffff8118b72f>] ? __delay+0xf/0x11
[  570.172299]  [<ffffffff8119a1ca>] ? do_raw_spin_lock+0xd2/0x13c
[  570.173297]  [<ffffffff813f827b>] ? _raw_spin_lock+0x60/0x73
[  570.173297]  [<ffffffff810c0a37>] ? page_lock_anon_vma+0x63/0xac
[  570.173297]  [<ffffffff810c0a37>] ? page_lock_anon_vma+0x63/0xac
[  570.173297]  [<ffffffff810c09d4>] ? page_lock_anon_vma+0x0/0xac
[  570.173297]  [<ffffffff810c0c1d>] ? page_referenced+0x80/0x1dc
[  570.173297]  [<ffffffff810c0b22>] ? try_to_unmap_anon+0xa2/0xb4
[  570.174329]  [<ffffffff810ab7a6>] ? shrink_page_list+0x14a/0x477
[  570.174329]  [<ffffffff813f8d86>] ? _raw_spin_unlock_irq+0x30/0x58
[  570.174329]  [<ffffffff810abe2a>] ? shrink_inactive_list+0x357/0x5e5
[  570.174329]  [<ffffffff810ab64a>] ? shrink_active_list+0x232/0x244
[  570.174329]  [<ffffffff810ac3c4>] ? shrink_zone+0x30c/0x3d6
[  570.174329]  [<ffffffff810acf9f>] ? do_try_to_free_pages+0x176/0x27f
[  570.174329]  [<ffffffff810ad13d>] ? shrink_all_memory+0x95/0xc4
[  570.175297]  [<ffffffff810aa640>] ? isolate_pages_global+0x0/0x1f0
[  570.175297]  [<ffffffff81076e60>] ? count_data_pages+0x65/0x79
[  570.175297]  [<ffffffff810770c7>] ? hibernate_preallocate_memory+0x1aa/0x2cb
[  570.175297]  [<ffffffff813f5285>] ? printk+0x41/0x44
[  570.175297]  [<ffffffff81075a67>] ? hibernation_snapshot+0x36/0x1e1
[  570.175297]  [<ffffffff81075ce0>] ? hibernate+0xce/0x172
[  570.175297]  [<ffffffff81074a4d>] ? state_store+0x5c/0xd3
[  570.176298]  [<ffffffff81184f8f>] ? kobj_attr_store+0x17/0x19
[  570.176298]  [<ffffffff81125dd7>] ? sysfs_write_file+0x108/0x144
[  570.176298]  [<ffffffff810d575f>] ? vfs_write+0xb2/0x153
[  570.176298]  [<ffffffff81063bed>] ? trace_hardirqs_on_caller+0x1f/0x14b
[  570.176298]  [<ffffffff810d58c3>] ? sys_write+0x4a/0x71
[  570.176298]  [<ffffffff810021db>] ? system_call_fastpath+0x16/0x1b


-- 
Regards/Gruss,
Boris.

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA
  2010-04-09 19:14                                                                                     ` Borislav Petkov
@ 2010-04-09 19:32                                                                                       ` Linus Torvalds
  2010-04-09 20:03                                                                                         ` Rik van Riel
  2010-04-09 20:43                                                                                         ` Johannes Weiner
  0 siblings, 2 replies; 242+ messages in thread
From: Linus Torvalds @ 2010-04-09 19:32 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: KOSAKI Motohiro, Rik van Riel, Andrew Morton, Minchan Kim,
	Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin,
	Andrea Arcangeli, Hugh Dickins, sgunderson, hannes



On Fri, 9 Apr 2010, Borislav Petkov wrote:
> > 
> > So what I _think_ will happen is that you'll be able to re-create the 
> > problem that started this all.  But I'd like to verify that, just because 
> > I'm anal and I'd like these things to be tested independently.
> 
> Heh, that was easy. Third hibernate cycle is a charm^Wboom :)

Ok, good to know that I'm still tracking ok on the issue.

> > So assuming that the original problem happens again, if you can then apply 
> > Rik's patch, but add a
> > 
> > 	dst->anon_vma = src->anon_vma;
> > 
> > to just before the success case (the "return 0") in anon_vma_clone(), 
> > that would be good.
> 
> It looks like this way we mangle the anon_vma chains somehow. From
> what I can see and if I'm not mistaken, we save the anon_vmas alright
> but end up in what seems like an endless list_for_each_entry()
> loop having grabbed anon_vma->lock in page_lock_anon_vma() and we
> can't seem to yield it through page_unlock_anon_vma() at the end of
> page_referenced_anon() so it has to be that code in between iterating
> over each list entry...

Ok. So scratch Rik's patch. It doesn't work even with the anon_vma set up.

Rik? I think it's back to you. I'm not going to bother committing the 
change to the anon_vma locking unless you actually need the locking 
guarantees for anon_vma_prepare().

And I've got the feeling that the proper fix is in the vma_adjust() 
handling if your original idea was right.

Anybody?

We're at the point where I've already delayed -rc4 several days because 
it's pointless cutting it without fixing this. One option is to just say 
"f*ck it, we'll revert it all and try again later". But it feels so 
close..

		Linus

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA
  2010-04-09 19:32                                                                                       ` Linus Torvalds
@ 2010-04-09 20:03                                                                                         ` Rik van Riel
  2010-04-09 20:43                                                                                         ` Johannes Weiner
  1 sibling, 0 replies; 242+ messages in thread
From: Rik van Riel @ 2010-04-09 20:03 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Borislav Petkov, KOSAKI Motohiro, Andrew Morton, Minchan Kim,
	Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin,
	Andrea Arcangeli, Hugh Dickins, sgunderson, hannes

On 04/09/2010 03:32 PM, Linus Torvalds wrote:

> Rik? I think it's back to you. I'm not going to bother committing the
> change to the anon_vma locking unless you actually need the locking
> guarantees for anon_vma_prepare().

> And I've got the feeling that the proper fix is in the vma_adjust()
> handling if your original idea was right.

We can fix it on the other side, by changing anon_vma_merge
to actually link all the anon_vma structs into the VMA.

An added benefit is that we are already holding the required
lock (mmap_sem) exclusively in that code path.

I'll cook up a patch and I'll mail it out after a little
testing.

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA
  2010-04-09 19:32                                                                                       ` Linus Torvalds
  2010-04-09 20:03                                                                                         ` Rik van Riel
@ 2010-04-09 20:43                                                                                         ` Johannes Weiner
  2010-04-09 20:57                                                                                           ` Rik van Riel
                                                                                                             ` (2 more replies)
  1 sibling, 3 replies; 242+ messages in thread
From: Johannes Weiner @ 2010-04-09 20:43 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Borislav Petkov, KOSAKI Motohiro, Rik van Riel, Andrew Morton,
	Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn,
	Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson

On Fri, Apr 09, 2010 at 12:32:30PM -0700, Linus Torvalds wrote:
> 
> 
> On Fri, 9 Apr 2010, Borislav Petkov wrote:
> > > 
> > > So what I _think_ will happen is that you'll be able to re-create the 
> > > problem that started this all.  But I'd like to verify that, just because 
> > > I'm anal and I'd like these things to be tested independently.
> > 
> > Heh, that was easy. Third hibernate cycle is a charm^Wboom :)
> 
> Ok, good to know that I'm still tracking ok on the issue.
> 
> > > So assuming that the original problem happens again, if you can then apply 
> > > Rik's patch, but add a
> > > 
> > > 	dst->anon_vma = src->anon_vma;
> > > 
> > > to just before the success case (the "return 0") in anon_vma_clone(), 
> > > that would be good.
> > 
> > It looks like this way we mangle the anon_vma chains somehow. From
> > what I can see and if I'm not mistaken, we save the anon_vmas alright
> > but end up in what seems like an endless list_for_each_entry()
> > loop having grabbed anon_vma->lock in page_lock_anon_vma() and we
> > can't seem to yield it through page_unlock_anon_vma() at the end of
> > page_referenced_anon() so it has to be that code in between iterating
> > over each list entry...
> 
> Ok. So scratch Rik's patch. It doesn't work even with the anon_vma set up.
> 
> Rik? I think it's back to you. I'm not going to bother committing the 
> change to the anon_vma locking unless you actually need the locking 
> guarantees for anon_vma_prepare().
> 
> And I've got the feeling that the proper fix is in the vma_adjust() 
> handling if your original idea was right.
> 
> Anybody?

Okay, I think I got it working.  I first thought we would need an
m^n loop to properly merge the anon_vma_chains, but we can actually
be cleverer than that:

---
Subject: mm: properly merge anon_vma_chains when merging vmas

Merging can happen when two VMAs were split from one root VMA or
a mergeable VMA was instantiated and reused a nearby VMA's anon_vma.

In both cases, none of the VMAs can grow any more anon_vmas and forked
VMAs can no longer get merged due to differing primary anon_vmas for
their private COW-broken pages.

In the split case, the anon_vma_chains are equal and we can just drop
the one of the VMA that is going away.

In the other case, the VMA that was instantiated later has only one
anon_vma on its chain: the primary anon_vma of its merge partner (due
to anon_vma_prepare()).

If the VMA that came later is going away, its anon_vma_chain is a
subset of the one that is staying, so it can be dropped like in the
split case.

Only if the VMA that came first is going away, its potential parent
anon_vmas need to be migrated to the VMA that is staying.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---

It compiles and boots but I have not really excercised this code.
Boris, could you give it a spin?  Thanks!

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index d25bd22..ecef882 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -114,13 +114,7 @@ int anon_vma_clone(struct vm_area_struct *, struct vm_area_struct *);
 int anon_vma_fork(struct vm_area_struct *, struct vm_area_struct *);
 void __anon_vma_link(struct vm_area_struct *);
 void anon_vma_free(struct anon_vma *);
-
-static inline void anon_vma_merge(struct vm_area_struct *vma,
-				  struct vm_area_struct *next)
-{
-	VM_BUG_ON(vma->anon_vma != next->anon_vma);
-	unlink_anon_vmas(next);
-}
+void anon_vma_merge(struct vm_area_struct *, struct vm_area_struct *);
 
 /*
  * rmap interfaces called when adding or removing pte of page
diff --git a/mm/rmap.c b/mm/rmap.c
index eaa7a09..498a46e 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -268,6 +268,58 @@ void unlink_anon_vmas(struct vm_area_struct *vma)
 	}
 }
 
+void anon_vma_merge(struct vm_area_struct *vma, struct vm_area_struct *next)
+{
+	VM_BUG_ON(vma->anon_vma != next->anon_vma);
+	/*
+	 * 1. case: vma and next are split parts of one root vma.
+	 * Their anon_vma_chain is equal and we can drop that of next.
+	 *
+	 * 2. case: one vma was instantiated as mergeable with the
+	 * other one and inherited the other one's primary anon_vma as
+	 * the singleton in its chain.
+	 *
+	 * If next came after vma, vma's chain is already an unstrict
+	 * superset of next's and we can treat it like case 1.
+	 *
+	 * If vma has the singleton chain, we have to copy next's
+	 * unique anon_vmas over.
+	 */
+	if (!list_is_singular(&vma->anon_vma_chain)) {
+		unlink_anon_vmas(next);
+		return;
+	}
+	while (!list_empty(&next->anon_vma_chain)) {
+		struct anon_vma_chain *avc;
+
+		avc = list_first_entry(&next->anon_vma_chain,
+				struct anon_vma_chain, same_vma);
+		if (avc->anon_vma == vma->anon_vma) {
+			/*
+			 * The shared one that vma inherited in
+			 * anon_vma_prepare.  Don't copy it, we
+			 * already have it.
+			 */
+			spin_lock(&avc->anon_vma->lock);
+			list_del(&avc->same_anon_vma);
+			spin_unlock(&avc->anon_vma->lock);
+
+			list_del(&avc->same_vma);
+			anon_vma_chain_free(avc);
+		} else {
+			/*
+			 * One of the parent anon_vmas, move it over.
+			 * Make sure nobody walks the vma list while
+			 * the entries are in flux.
+			 */
+			spin_lock(&avc->anon_vma->lock);
+			avc->vma = vma;
+			list_move_tail(&avc->same_vma, &vma->anon_vma_chain);
+			spin_unlock(&avc->anon_vma->lock);
+		}
+	}
+}
+
 static void anon_vma_ctor(void *data)
 {
 	struct anon_vma *anon_vma = data;


^ permalink raw reply related	[flat|nested] 242+ messages in thread

* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA
  2010-04-09 20:43                                                                                         ` Johannes Weiner
@ 2010-04-09 20:57                                                                                           ` Rik van Riel
  2010-04-09 21:33                                                                                           ` Borislav Petkov
  2010-04-09 23:22                                                                                           ` Linus Torvalds
  2 siblings, 0 replies; 242+ messages in thread
From: Rik van Riel @ 2010-04-09 20:57 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Linus Torvalds, Borislav Petkov, KOSAKI Motohiro, Andrew Morton,
	Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn,
	Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson

On 04/09/2010 04:43 PM, Johannes Weiner wrote:

> Okay, I think I got it working.  I first thought we would need an
> m^n loop to properly merge the anon_vma_chains, but we can actually
> be cleverer than that:

I've looked it over 5 times, can't find anything wrong
with it.  Your approach looks like it should work just
fine.

Certainly easier than the things Linus and I tried :)

> Signed-off-by: Johannes Weiner<hannes@cmpxchg.org>

Reviewed-by: Rik van Riel <riel@redhat.com>

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA
  2010-04-09 20:43                                                                                         ` Johannes Weiner
  2010-04-09 20:57                                                                                           ` Rik van Riel
@ 2010-04-09 21:33                                                                                           ` Borislav Petkov
  2010-04-09 23:22                                                                                           ` Linus Torvalds
  2 siblings, 0 replies; 242+ messages in thread
From: Borislav Petkov @ 2010-04-09 21:33 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Linus Torvalds, KOSAKI Motohiro, Rik van Riel, Andrew Morton,
	Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn,
	Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson

From: Johannes Weiner <hannes@cmpxchg.org>
Date: Fri, Apr 09, 2010 at 10:43:28PM +0200

Hi Hannes :) ,

> ---
> Subject: mm: properly merge anon_vma_chains when merging vmas
> 
> Merging can happen when two VMAs were split from one root VMA or
> a mergeable VMA was instantiated and reused a nearby VMA's anon_vma.
> 
> In both cases, none of the VMAs can grow any more anon_vmas and forked
> VMAs can no longer get merged due to differing primary anon_vmas for
> their private COW-broken pages.
> 
> In the split case, the anon_vma_chains are equal and we can just drop
> the one of the VMA that is going away.
> 
> In the other case, the VMA that was instantiated later has only one
> anon_vma on its chain: the primary anon_vma of its merge partner (due
> to anon_vma_prepare()).
> 
> If the VMA that came later is going away, its anon_vma_chain is a
> subset of the one that is staying, so it can be dropped like in the
> split case.
> 
> Only if the VMA that came first is going away, its potential parent
> anon_vmas need to be migrated to the VMA that is staying.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
> 
> It compiles and boots but I have not really excercised this code.
> Boris, could you give it a spin?  Thanks!

ok, I got this ontop of mainline (no other patches from this thread)
but unfortunately it breaks at the same spot while under heavy page
reclaiming when trying to hibernate while booting 3 guests.

[  322.171120] PM: Preallocating image memory... 
[  322.477374] BUG: unable to handle kernel NULL pointer dereference at (null)
[  322.477376] IP: [<ffffffff810c0c87>] page_referenced+0xee/0x1dc
[  322.477376] PGD 2014e8067 PUD 221b4e067 PMD 0 
[  322.477376] Oops: 0000 [#1] PREEMPT SMP 
[  322.477376] last sysfs file: /sys/devices/system/cpu/cpu3/cpufreq/scaling_cur_freq
[  322.477376] CPU 3 
[  322.477376] Modules linked in: powernow_k8 cpufreq_ondemand cpufreq_powersave cpufreq_userspace freq_table cpufreq_conservative binfmt_misc kvm_amd kvm ipv6 vfat fat dm_crypt dm_mod 8250_pnp 8250 pcspkr serial_core k10temp ohci_hcd edac_core
[  322.477376] 
[  322.477376] Pid: 2750, comm: hib.sh Tainted: G        W  2.6.34-rc3-00411-ga7247b6 #13 M3A78 PRO/System Product Name
[  322.477376] RIP: 0010:[<ffffffff810c0c87>]  [<ffffffff810c0c87>] page_referenced+0xee/0x1dc
[  322.477376] RSP: 0018:ffff88020936d8b8  EFLAGS: 00010283
[  322.477376] RAX: ffff88022de91af0 RBX: ffffea0006dcb488 RCX: 0000000000000000
[  322.477376] RDX: ffff88020936dcf8 RSI: ffff88022de91ac8 RDI: ffff88022ced0000
[  322.477376] RBP: ffff88020936d938 R08: 0000000000000002 R09: 0000000000000000
[  322.477376] R10: 0000000000000246 R11: 0000000000000003 R12: 0000000000000000
[  322.477376] R13: ffffffffffffffe0 R14: ffff88022de91ab0 R15: ffff88020936da00
[  322.477376] FS:  00007f286493e6f0(0000) GS:ffff88000a600000(0000) knlGS:0000000000000000
[  322.477376] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[  322.477376] CR2: 0000000000000000 CR3: 00000001f8354000 CR4: 00000000000006e0
[  322.477376] DR0: 0000000000000090 DR1: 00000000000000a4 DR2: 00000000000000ff
[  322.477376] DR3: 000000000000000f DR6: 00000000ffff0ff0 DR7: 0000000000000400
[  322.477376] Process hib.sh (pid: 2750, threadinfo ffff88020936c000, task ffff88022ced0000)
[  322.477376] Stack:
[  322.477376]  ffff88022de91af0 00000000813f8eec ffffffff8165ce28 000000000000002e
[  322.477376] <0> ffff88020936d8f8 ffffffff810c60bc ffffea0006dcb450 ffffea0006dcb450
[  322.477376] <0> ffff88020936d938 00000002810ab29d 0000000006f316b0 ffffea0006dcb4b0
[  322.477376] Call Trace:
[  322.477376]  [<ffffffff810c60bc>] ? swapcache_free+0x37/0x3c
[  322.477376]  [<ffffffff810ab7c2>] shrink_page_list+0x14a/0x477
[  322.477376]  [<ffffffff810abe46>] shrink_inactive_list+0x357/0x5e5
[  322.477376]  [<ffffffff810ab666>] ? shrink_active_list+0x232/0x244
[  322.477376]  [<ffffffff810ac3e0>] shrink_zone+0x30c/0x3d6
[  322.477376]  [<ffffffff810acfbb>] do_try_to_free_pages+0x176/0x27f
[  322.477376]  [<ffffffff810ad159>] shrink_all_memory+0x95/0xc4
[  322.477376]  [<ffffffff810aa65c>] ? isolate_pages_global+0x0/0x1f0
[  322.477376]  [<ffffffff81076e7c>] ? count_data_pages+0x65/0x79
[  322.477376]  [<ffffffff810770e3>] hibernate_preallocate_memory+0x1aa/0x2cb
[  322.477376]  [<ffffffff813f5325>] ? printk+0x41/0x44
[  322.477376]  [<ffffffff81075a83>] hibernation_snapshot+0x36/0x1e1
[  322.477376]  [<ffffffff81075cfc>] hibernate+0xce/0x172
[  322.477376]  [<ffffffff81074a69>] state_store+0x5c/0xd3
[  322.477376]  [<ffffffff81185043>] kobj_attr_store+0x17/0x19
[  322.477376]  [<ffffffff81125e87>] sysfs_write_file+0x108/0x144
[  322.477376]  [<ffffffff810d580f>] vfs_write+0xb2/0x153
[  322.477376]  [<ffffffff81063c09>] ? trace_hardirqs_on_caller+0x1f/0x14b
[  322.477376]  [<ffffffff810d5973>] sys_write+0x4a/0x71
[  322.477376]  [<ffffffff810021db>] system_call_fastpath+0x16/0x1b
[  322.477376] Code: 3b 56 10 73 1e 48 83 fa f2 74 18 48 8d 4d cc 4d 89 f8 48 89 df e8 77 f2 ff ff 41 01 c4 83 7d cc 00 74 19 4d 8b 6d 20 49 83 ed 20 <49> 8b 45 20 0f 18 08 49 8d 45 20 48 39 45 80 75 aa 4c 89 f7 e8 
[  322.477376] RIP  [<ffffffff810c0c87>] page_referenced+0xee/0x1dc
[  322.477376]  RSP <ffff88020936d8b8>
[  322.477376] CR2: 0000000000000000
[  322.491359] ---[ end trace 520a5274d8859b71 ]---
[  322.491509] note: hib.sh[2750] exited with preempt_count 2
[  322.491663] BUG: scheduling while atomic: hib.sh/2750/0x10000003
[  322.491810] INFO: lockdep is turned off.
[  322.491956] Modules linked in: powernow_k8 cpufreq_ondemand cpufreq_powersave cpufreq_userspace freq_table cpufreq_conservative binfmt_misc kvm_amd kvm ipv6 vfat fat dm_crypt dm_mod 8250_pnp 8250 pcspkr serial_core k10temp ohci_hcd edac_core
[  322.493364] Pid: 2750, comm: hib.sh Tainted: G      D W  2.6.34-rc3-00411-ga7247b6 #13
[  322.493622] Call Trace:
[  322.493768]  [<ffffffff8106311f>] ? __debug_show_held_locks+0x1b/0x24
[  322.493919]  [<ffffffff8102d3d0>] __schedule_bug+0x72/0x77
[  322.494070]  [<ffffffff813f572e>] schedule+0xd9/0x730
[  322.494223]  [<ffffffff8103023c>] __cond_resched+0x18/0x24
[  322.494378]  [<ffffffff813f5e52>] _cond_resched+0x2c/0x37
[  322.494527]  [<ffffffff810b7da5>] unmap_vmas+0x6ce/0x893
[  322.494678]  [<ffffffff813f8e86>] ? _raw_spin_unlock_irqrestore+0x38/0x69
[  322.494829]  [<ffffffff810bc457>] exit_mmap+0xd7/0x182
[  322.494978]  [<ffffffff81035969>] mmput+0x48/0xb9
[  322.495131]  [<ffffffff81039c39>] exit_mm+0x110/0x11d
[  322.495280]  [<ffffffff8103b67b>] do_exit+0x1c5/0x691
[  322.495521]  [<ffffffff81038d25>] ? kmsg_dump+0x13b/0x155
[  322.495668]  [<ffffffff810060db>] ? oops_end+0x47/0x93
[  322.495816]  [<ffffffff81006122>] oops_end+0x8e/0x93
[  322.495964]  [<ffffffff8101ed95>] no_context+0x1fc/0x20b
[  322.496118]  [<ffffffff8101ef30>] __bad_area_nosemaphore+0x18c/0x1af
[  322.496267]  [<ffffffff8101f16b>] ? do_page_fault+0xa8/0x32d
[  322.496484]  [<ffffffff8101ef66>] bad_area_nosemaphore+0x13/0x15
[  322.496630]  [<ffffffff8101f236>] do_page_fault+0x173/0x32d
[  322.496780]  [<ffffffff813f96e3>] ? error_sti+0x5/0x6
[  322.496928]  [<ffffffff81062bc7>] ? trace_hardirqs_off_caller+0x1f/0xa9
[  322.497082]  [<ffffffff813f80d2>] ? trace_hardirqs_off_thunk+0x3a/0x3c
[  322.497232]  [<ffffffff813f94ff>] page_fault+0x1f/0x30
[  322.497392]  [<ffffffff810c0c87>] ? page_referenced+0xee/0x1dc
[  322.497541]  [<ffffffff810c0c19>] ? page_referenced+0x80/0x1dc
[  322.497690]  [<ffffffff810c60bc>] ? swapcache_free+0x37/0x3c
[  322.497839]  [<ffffffff810ab7c2>] shrink_page_list+0x14a/0x477
[  322.497989]  [<ffffffff810abe46>] shrink_inactive_list+0x357/0x5e5
[  322.498141]  [<ffffffff810ab666>] ? shrink_active_list+0x232/0x244
[  322.498291]  [<ffffffff810ac3e0>] shrink_zone+0x30c/0x3d6
[  322.498444]  [<ffffffff810acfbb>] do_try_to_free_pages+0x176/0x27f
[  322.498594]  [<ffffffff810ad159>] shrink_all_memory+0x95/0xc4
[  322.498743]  [<ffffffff810aa65c>] ? isolate_pages_global+0x0/0x1f0
[  322.498892]  [<ffffffff81076e7c>] ? count_data_pages+0x65/0x79
[  322.499046]  [<ffffffff810770e3>] hibernate_preallocate_memory+0x1aa/0x2cb
[  322.499195]  [<ffffffff813f5325>] ? printk+0x41/0x44
[  322.499344]  [<ffffffff81075a83>] hibernation_snapshot+0x36/0x1e1
[  322.499498]  [<ffffffff81075cfc>] hibernate+0xce/0x172
[  322.499647]  [<ffffffff81074a69>] state_store+0x5c/0xd3
[  322.499795]  [<ffffffff81185043>] kobj_attr_store+0x17/0x19
[  322.499944]  [<ffffffff81125e87>] sysfs_write_file+0x108/0x144
[  322.500097]  [<ffffffff810d580f>] vfs_write+0xb2/0x153
[  322.500246]  [<ffffffff81063c09>] ? trace_hardirqs_on_caller+0x1f/0x14b
[  322.500399]  [<ffffffff810d5973>] sys_write+0x4a/0x71
[  322.500547]  [<ffffffff810021db>] system_call_fastpath+0x16/0x1b

-- 
Regards/Gruss,
Boris.

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA
  2010-04-09 20:43                                                                                         ` Johannes Weiner
  2010-04-09 20:57                                                                                           ` Rik van Riel
  2010-04-09 21:33                                                                                           ` Borislav Petkov
@ 2010-04-09 23:22                                                                                           ` Linus Torvalds
  2010-04-09 23:45                                                                                             ` Rik van Riel
                                                                                                               ` (2 more replies)
  2 siblings, 3 replies; 242+ messages in thread
From: Linus Torvalds @ 2010-04-09 23:22 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Borislav Petkov, KOSAKI Motohiro, Rik van Riel, Andrew Morton,
	Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn,
	Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson



On Fri, 9 Apr 2010, Johannes Weiner wrote:
> +	/*
> +	 * 1. case: vma and next are split parts of one root vma.
> +	 * Their anon_vma_chain is equal and we can drop that of next.
> +	 *
> +	 * 2. case: one vma was instantiated as mergeable with the
> +	 * other one and inherited the other one's primary anon_vma as
> +	 * the singleton in its chain.
> +	 *
> +	 * If next came after vma, vma's chain is already an unstrict
> +	 * superset of next's and we can treat it like case 1.
> +	 *
> +	 * If vma has the singleton chain, we have to copy next's
> +	 * unique anon_vmas over.
> +	 */

This comment makes my head hurt. In fact, the whole anon_vma thing hurts 
my head.

Can we have some better high-level documentation on what happens for all 
the cases.

 - split (mprotect, or munmap in the middle):

	anon_vma_clone: the two vma's will have the same anon_vma, and the 
	anon_vma chains will be equivalent. 

 - merge (mprotect that creates a mergeable state):

	anon_vma_merge: we're supposed to have a anon_vma_chain that is 
	a superset of the two chains of the merged entries.

 - fork:

	anon_vma_fork: each new vma will have a _new_ anon_vma as it's 
	primary one, and will link to the old primary trough the 
	anon_vma_chain. It's doing this with a anon_vma_clone() followed 
	by adding an entra entry to the new anon_vma, and setting 
	vma->anon_vma to the new one.

 - create/mmap:

	anon_vma_prepare: find a mergeable anon_vma and use that as a 
	singleton, because the other entries on the anon_vma chain won't 
	matter, since they cannot be associated with any pages associated 
	with the newly created vma..

Correct?

Quite frankly, just looking at that, I can't see how we get to your rules. 
At least not trivially. Especially with multiple merges, I don't see 
how "singleton" is such a special case.

		Linus

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA
  2010-04-09 23:22                                                                                           ` Linus Torvalds
@ 2010-04-09 23:45                                                                                             ` Rik van Riel
  2010-04-10  0:03                                                                                               ` Linus Torvalds
  2010-04-09 23:54                                                                                             ` Johannes Weiner
  2010-04-09 23:56                                                                                             ` Linus Torvalds
  2 siblings, 1 reply; 242+ messages in thread
From: Rik van Riel @ 2010-04-09 23:45 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Johannes Weiner, Borislav Petkov, KOSAKI Motohiro, Andrew Morton,
	Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn,
	Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson

On 04/09/2010 07:22 PM, Linus Torvalds wrote:
>
>
> On Fri, 9 Apr 2010, Johannes Weiner wrote:
>> +	/*
>> +	 * 1. case: vma and next are split parts of one root vma.
>> +	 * Their anon_vma_chain is equal and we can drop that of next.
>> +	 *
>> +	 * 2. case: one vma was instantiated as mergeable with the
>> +	 * other one and inherited the other one's primary anon_vma as
>> +	 * the singleton in its chain.
>> +	 *
>> +	 * If next came after vma, vma's chain is already an unstrict
>> +	 * superset of next's and we can treat it like case 1.
>> +	 *
>> +	 * If vma has the singleton chain, we have to copy next's
>> +	 * unique anon_vmas over.
>> +	 */
>
> This comment makes my head hurt. In fact, the whole anon_vma thing hurts
> my head.
>
> Can we have some better high-level documentation on what happens for all
> the cases.
>
>   - split (mprotect, or munmap in the middle):
>
> 	anon_vma_clone: the two vma's will have the same anon_vma, and the
> 	anon_vma chains will be equivalent.
>
>   - merge (mprotect that creates a mergeable state):
>
> 	anon_vma_merge: we're supposed to have a anon_vma_chain that is
> 	a superset of the two chains of the merged entries.
>
>   - fork:
>
> 	anon_vma_fork: each new vma will have a _new_ anon_vma as it's
> 	primary one, and will link to the old primary trough the
> 	anon_vma_chain. It's doing this with a anon_vma_clone() followed
> 	by adding an entra entry to the new anon_vma, and setting
> 	vma->anon_vma to the new one.
>
>   - create/mmap:
>
> 	anon_vma_prepare: find a mergeable anon_vma and use that as a
> 	singleton, because the other entries on the anon_vma chain won't
> 	matter, since they cannot be associated with any pages associated
> 	with the newly created vma..
>
> Correct?

This is indeed correct.

> Quite frankly, just looking at that, I can't see how we get to your rules.
> At least not trivially. Especially with multiple merges, I don't see
> how "singleton" is such a special case.

The trick is in the fact that anon_vma_merge is only called
when vma->anon_vma == vma1->anon_vma.

If the top anon_vmas are different, then anon_vma_merge will
not be called.

This means that VMAs which have recently passed through fork
will not be passed to anon_vma_merge, because their top
anon_vmas are different.

That leaves just the split & create cases, which will be
passed to anon_vma_merge when they are merged.

In case of split, they will have identical anon_vma chains.

In case of create + merge, one of the two VMAs will have
the whole anon_vma chain, while the other one has just
the top anon_vma.

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA
  2010-04-09 23:22                                                                                           ` Linus Torvalds
  2010-04-09 23:45                                                                                             ` Rik van Riel
@ 2010-04-09 23:54                                                                                             ` Johannes Weiner
  2010-04-09 23:56                                                                                             ` Linus Torvalds
  2 siblings, 0 replies; 242+ messages in thread
From: Johannes Weiner @ 2010-04-09 23:54 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Borislav Petkov, KOSAKI Motohiro, Rik van Riel, Andrew Morton,
	Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn,
	Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson

On Fri, Apr 09, 2010 at 04:22:19PM -0700, Linus Torvalds wrote:
> 
> 
> On Fri, 9 Apr 2010, Johannes Weiner wrote:
> > +	/*
> > +	 * 1. case: vma and next are split parts of one root vma.
> > +	 * Their anon_vma_chain is equal and we can drop that of next.
> > +	 *
> > +	 * 2. case: one vma was instantiated as mergeable with the
> > +	 * other one and inherited the other one's primary anon_vma as
> > +	 * the singleton in its chain.
> > +	 *
> > +	 * If next came after vma, vma's chain is already an unstrict
> > +	 * superset of next's and we can treat it like case 1.
> > +	 *
> > +	 * If vma has the singleton chain, we have to copy next's
> > +	 * unique anon_vmas over.
> > +	 */
> 
> This comment makes my head hurt. In fact, the whole anon_vma thing hurts 
> my head.

I can relate ;)

> Can we have some better high-level documentation on what happens for all 
> the cases.
> 
>  - split (mprotect, or munmap in the middle):
> 
> 	anon_vma_clone: the two vma's will have the same anon_vma, and the 
> 	anon_vma chains will be equivalent. 
> 
>  - merge (mprotect that creates a mergeable state):
> 
> 	anon_vma_merge: we're supposed to have a anon_vma_chain that is 
> 	a superset of the two chains of the merged entries.
> 
>  - fork:
> 
> 	anon_vma_fork: each new vma will have a _new_ anon_vma as it's 
> 	primary one, and will link to the old primary trough the 
> 	anon_vma_chain. It's doing this with a anon_vma_clone() followed 
> 	by adding an entra entry to the new anon_vma, and setting 
> 	vma->anon_vma to the new one.
> 
>  - create/mmap:
> 
> 	anon_vma_prepare: find a mergeable anon_vma and use that as a 
> 	singleton, because the other entries on the anon_vma chain won't 
> 	matter, since they cannot be associated with any pages associated 
> 	with the newly created vma..
> 
> Correct?
> 
> Quite frankly, just looking at that, I can't see how we get to your rules. 
> At least not trivially. Especially with multiple merges, I don't see 
> how "singleton" is such a special case.

The key is that merging is only possible if the primary anon_vmas are
equivalent.

This only happens if we split a vma in two and clone the old vma's
anon_vma_chain into the new vma.  So the chains are equivalent.

Or anon_vma_prepare() finds a mergeable anon_vma, in which case this
will be the singleton on the vma's chain.

If a split vma is merged, the old anon_vma_chains are equivalent, we
drop one completely and the one that stays has not changed.

If a mergeable vma (singleton anon_vma) is merged into another one,
this singleton is the primary anon_vma of the swallowing vma, thus
already linked and the swallowing vma's anon_vma_chain stays unchanged.

If it's the other way round and the singleton vma swallows the other
one, every anon_vma of the vanishing vma is moved over (except your
singleton anon_vma, you already have that).  The result should look
exactly like the chain we swallowed.  So in all this merging, no
unique and new combination of anon_vma_chains should have been
created!  Thus you can merge as much as you want, either you swallow
singletons and don't change yourself or you are the singleton and
after the merger have an equivalent anon_vma_chain to the vma you
swallowed.

Again: no new anon_vmas should enter the game for mergeable vmas and
no _new_ anon_vma_chains should be created while merging.  Thus it
is always true that you either merge with a singleton or the chains
are equivalent.

At least those are my assumptions.  Maybe they are crap, but I don't
see how right now.

And according to Boris' test, somewhere we still drop anon_vmas
where we let pages in the field pointing at them.

	Hannes

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA
  2010-04-09 23:22                                                                                           ` Linus Torvalds
  2010-04-09 23:45                                                                                             ` Rik van Riel
  2010-04-09 23:54                                                                                             ` Johannes Weiner
@ 2010-04-09 23:56                                                                                             ` Linus Torvalds
  2010-04-10  0:19                                                                                               ` Rik van Riel
  2010-04-10  0:31                                                                                               ` Johannes Weiner
  2 siblings, 2 replies; 242+ messages in thread
From: Linus Torvalds @ 2010-04-09 23:56 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Borislav Petkov, KOSAKI Motohiro, Rik van Riel, Andrew Morton,
	Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn,
	Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson



On Fri, 9 Apr 2010, Linus Torvalds wrote:
> 
> Can we have some better high-level documentation on what happens for all 
> the cases.
> 
>  - split (mprotect, or munmap in the middle):
> 
> 	anon_vma_clone: the two vma's will have the same anon_vma, and the 
> 	anon_vma chains will be equivalent. 
> 
>  - merge (mprotect that creates a mergeable state):
> 
> 	anon_vma_merge: we're supposed to have a anon_vma_chain that is 
> 	a superset of the two chains of the merged entries.
> 
>  - fork:
> 
> 	anon_vma_fork: each new vma will have a _new_ anon_vma as it's 
> 	primary one, and will link to the old primary trough the 
> 	anon_vma_chain. It's doing this with a anon_vma_clone() followed 
> 	by adding an entra entry to the new anon_vma, and setting 
> 	vma->anon_vma to the new one.
> 
>  - create/mmap:
> 
> 	anon_vma_prepare: find a mergeable anon_vma and use that as a 
> 	singleton, because the other entries on the anon_vma chain won't 
> 	matter, since they cannot be associated with any pages associated 
> 	with the newly created vma..
> 
> Correct?

Ok, so I don't know if the above is correct, but if it is, let's ignore 
the "merge" case as being complex, and look at the other cases.

With fork, the main anon_vma becomes different, so let's ignore that. That 
always means that the resulting list is not comparable or compatible, and 
we'll never mix them up.

If we make one very _simple_ rule for the create/mmap case, namely that we 
only re-use another _singleton_ anon_vma, then split and create case will 
look exactly the same. And in particular, we get a very simple and 
powerful rule: if the anon_vma matches, then the _list_ will also always 
match.

And that, in turn, would make 'merge' trivial too: you really can always 
drop the side that goes away. There's never any question about how to 
merge the lists, or which to pick, because every single operation that 
leaves the anon_vma the same will guarantee that the list will be 
identical too.

So now the simple rule is that if the anon_vma is the same, then the list 
of associated anon_vma's will always be the same - across all of merge, 
split and create.

Isn't that a _much_ simpler model to think about?

So _instead_ of all the patches that have floated about, I would suggest 
this simple change to "find_mergeable_anon_vma()" instead..

Oh, and maybe it's the meds talking again. I'm feeling better than 
yesterday, but am still a bit lightheaded. 

		Linus

---
 mm/mmap.c |    6 ++++--
 1 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index 75557c6..462a8ca 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -850,7 +850,8 @@ struct anon_vma *find_mergeable_anon_vma(struct vm_area_struct *vma)
 	vm_flags = vma->vm_flags & ~(VM_READ|VM_WRITE|VM_EXEC);
 	vm_flags |= near->vm_flags & (VM_READ|VM_WRITE|VM_EXEC);
 
-	if (near->anon_vma && vma->vm_end == near->vm_start &&
+	if (near->anon_vma && list_is_singular(&near->anon_vma_chain) &&
+			vma->vm_end == near->vm_start &&
  			mpol_equal(vma_policy(vma), vma_policy(near)) &&
 			can_vma_merge_before(near, vm_flags,
 				NULL, vma->vm_file, vma->vm_pgoff +
@@ -871,7 +872,8 @@ try_prev:
 	vm_flags = vma->vm_flags & ~(VM_READ|VM_WRITE|VM_EXEC);
 	vm_flags |= near->vm_flags & (VM_READ|VM_WRITE|VM_EXEC);
 
-	if (near->anon_vma && near->vm_end == vma->vm_start &&
+	if (near->anon_vma && list_is_singular(&near->anon_vma_chain) &&
+			near->vm_end == vma->vm_start &&
   			mpol_equal(vma_policy(near), vma_policy(vma)) &&
 			can_vma_merge_after(near, vm_flags,
 				NULL, vma->vm_file, vma->vm_pgoff))

^ permalink raw reply related	[flat|nested] 242+ messages in thread

* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA
  2010-04-09 23:45                                                                                             ` Rik van Riel
@ 2010-04-10  0:03                                                                                               ` Linus Torvalds
  2010-04-10  0:11                                                                                                 ` Rik van Riel
  0 siblings, 1 reply; 242+ messages in thread
From: Linus Torvalds @ 2010-04-10  0:03 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Johannes Weiner, Borislav Petkov, KOSAKI Motohiro, Andrew Morton,
	Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn,
	Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson



On Fri, 9 Apr 2010, Rik van Riel wrote:
> 
> The trick is in the fact that anon_vma_merge is only called
> when vma->anon_vma == vma1->anon_vma.

Sure sure. I still think it's _way_ too complex. See my previous email 
where I suggested one single simple additional rule that I think makes 
things _much_ simpler.

> If the top anon_vmas are different, then anon_vma_merge will
> not be called.

Right. The case of different anon_vma's is the trivial one. I don't worry 
about that.

> That leaves just the split & create cases, which will be
> passed to anon_vma_merge when they are merged.
> 
> In case of split, they will have identical anon_vma chains.

And yes, split is fundamentally simple. Split guarantees that the chains 
look identical.

But:

> In case of create + merge, one of the two VMAs will have
> the whole anon_vma chain, while the other one has just
> the top anon_vma.

THIS is where I think you simplified a lot and said "and magic happens".

The thing is, in the case of create, we create a different chain. That 
simple fact just makes merging fundamentally complicated.  And we now have 
two different chains, and both of those can split, so those differences 
can "spread out". And you need to guarantee that "merge" really works. It 
didn't work in your original code, and quite frankly, I do _not_ think 
it's entirely obvious that it works in Johannes' code either.

Don't get me wrong: _maybe_ Johannes' code works fine. I just don't think 
it's obvious at all. And if it doesn't work fine, now you're just 
spreading the differences even further.

This is why I suggest that we limit the "re-use an existing vma for a new 
case" to the singleton case, which means that now you _never_ have 
differences at all. There's no spreading on splitting. Merging is trivial. 

Now, admittedly, I'm really hopped up on cough medication, so the feeling 
of this solving all the problems in the universe may not be entirely 
accurate. But it feels so _right_.

I hope if feels right when I'm off my meds too.

			Linus

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA
  2010-04-10  0:03                                                                                               ` Linus Torvalds
@ 2010-04-10  0:11                                                                                                 ` Rik van Riel
  0 siblings, 0 replies; 242+ messages in thread
From: Rik van Riel @ 2010-04-10  0:11 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Johannes Weiner, Borislav Petkov, KOSAKI Motohiro, Andrew Morton,
	Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn,
	Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson

On 04/09/2010 08:03 PM, Linus Torvalds wrote:

> This is why I suggest that we limit the "re-use an existing vma for a new
> case" to the singleton case, which means that now you _never_ have
> differences at all. There's no spreading on splitting. Merging is trivial.

That looks like it should work.

> Now, admittedly, I'm really hopped up on cough medication, so the feeling
> of this solving all the problems in the universe may not be entirely
> accurate. But it feels so _right_.
>
> I hope if feels right when I'm off my meds too.

I am not on any cough meds, and your patch looks right.
OTOH, maybe I should be on some kind of cold meds, because
I haven't been feeling right all week...

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA
  2010-04-09 23:56                                                                                             ` Linus Torvalds
@ 2010-04-10  0:19                                                                                               ` Rik van Riel
  2010-04-10  0:31                                                                                               ` Johannes Weiner
  1 sibling, 0 replies; 242+ messages in thread
From: Rik van Riel @ 2010-04-10  0:19 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Johannes Weiner, Borislav Petkov, KOSAKI Motohiro, Andrew Morton,
	Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn,
	Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson

On 04/09/2010 07:56 PM, Linus Torvalds wrote:

> So _instead_ of all the patches that have floated about, I would suggest
> this simple change to "find_mergeable_anon_vma()" instead..

Boris, this is your chance to really ruin our week :)

If the bug persists with Linus's patch, we've been fixing
the wrong bug all week long, and you are experiencing
something else...

I'm getting really curious now.

> ---
>   mm/mmap.c |    6 ++++--
>   1 files changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 75557c6..462a8ca 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -850,7 +850,8 @@ struct anon_vma *find_mergeable_anon_vma(struct vm_area_struct *vma)
>   	vm_flags = vma->vm_flags&  ~(VM_READ|VM_WRITE|VM_EXEC);
>   	vm_flags |= near->vm_flags&  (VM_READ|VM_WRITE|VM_EXEC);
>
> -	if (near->anon_vma&&  vma->vm_end == near->vm_start&&
> +	if (near->anon_vma&&  list_is_singular(&near->anon_vma_chain)&&
> +			vma->vm_end == near->vm_start&&
>    			mpol_equal(vma_policy(vma), vma_policy(near))&&
>   			can_vma_merge_before(near, vm_flags,
>   				NULL, vma->vm_file, vma->vm_pgoff +
> @@ -871,7 +872,8 @@ try_prev:
>   	vm_flags = vma->vm_flags&  ~(VM_READ|VM_WRITE|VM_EXEC);
>   	vm_flags |= near->vm_flags&  (VM_READ|VM_WRITE|VM_EXEC);
>
> -	if (near->anon_vma&&  near->vm_end == vma->vm_start&&
> +	if (near->anon_vma&&  list_is_singular(&near->anon_vma_chain)&&
> +			near->vm_end == vma->vm_start&&
>     			mpol_equal(vma_policy(near), vma_policy(vma))&&
>   			can_vma_merge_after(near, vm_flags,
>   				NULL, vma->vm_file, vma->vm_pgoff))


^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA
  2010-04-09 23:56                                                                                             ` Linus Torvalds
  2010-04-10  0:19                                                                                               ` Rik van Riel
@ 2010-04-10  0:31                                                                                               ` Johannes Weiner
  2010-04-10  0:32                                                                                                 ` Linus Torvalds
  1 sibling, 1 reply; 242+ messages in thread
From: Johannes Weiner @ 2010-04-10  0:31 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Borislav Petkov, KOSAKI Motohiro, Rik van Riel, Andrew Morton,
	Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn,
	Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson

On Fri, Apr 09, 2010 at 04:56:13PM -0700, Linus Torvalds wrote:
> So _instead_ of all the patches that have floated about, I would suggest 
> this simple change to "find_mergeable_anon_vma()" instead..

That leaves the chance that my code was correct and we leave a conceptual
error around somewhere that can materialize again.  But I am at a point
where simplification never sounded more blissful, so yeah, I like it :)

Let's hope it fixes Boris's issue.

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA
  2010-04-10  0:31                                                                                               ` Johannes Weiner
@ 2010-04-10  0:32                                                                                                 ` Linus Torvalds
  2010-04-10  7:27                                                                                                   ` Borislav Petkov
  0 siblings, 1 reply; 242+ messages in thread
From: Linus Torvalds @ 2010-04-10  0:32 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Borislav Petkov, KOSAKI Motohiro, Rik van Riel, Andrew Morton,
	Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn,
	Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson



On Sat, 10 Apr 2010, Johannes Weiner wrote:
> 
> That leaves the chance that my code was correct and we leave a conceptual
> error around somewhere that can materialize again.

Absolutely. I really don't know whether your merge routine works or not. 
I'd just rather not have to even _try_ to understand it.

I have a fairly simple rule for most of the code I see: if I have a hard 
time understanding why it should work, I don't really want to rely on it.

> But I am at a point where simplification never sounded more blissful, so 
> yeah, I like it :)

Exactly. This is the "let's limit things a bit to keep them much simpler.

> Let's hope it fixes Boris's issue.

I'm going to just guess that it won't, and that Boris' issue was actually 
due to something else entirely, and we've all been staring at totally the 
wrong code.

But we can hope.

			Linus

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA
  2010-04-10  0:32                                                                                                 ` Linus Torvalds
@ 2010-04-10  7:27                                                                                                   ` Borislav Petkov
  2010-04-10 11:26                                                                                                     ` Borislav Petkov
  0 siblings, 1 reply; 242+ messages in thread
From: Borislav Petkov @ 2010-04-10  7:27 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Johannes Weiner, KOSAKI Motohiro, Rik van Riel, Andrew Morton,
	Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn,
	Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson

From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Fri, Apr 09, 2010 at 05:32:36PM -0700

> Exactly. This is the "let's limit things a bit to keep them much simpler.

You gotta love that rule :)

> > Let's hope it fixes Boris's issue.
> 
> I'm going to just guess that it won't, and that Boris' issue was actually 
> due to something else entirely, and we've all been staring at totally the 
> wrong code.
> 
> But we can hope.

Now why would you go and jinx it like that... :)

Hibernation runs back-to-back:

1. light system load after boot... ok
2. 3 kvm guests, 3Gb mem free of 8Gb total acc. to /proc/meminfo... ok			[ this was the fireproof way to trigger the bug, btw]
3. kvm guests down, firefox loading a 4Mb html page... ok
4. start ubuntu guest, firefox keeps loading the 4Mb html page after previous resume... ok
5. ubuntu guest booting done, firefox done, play video... ok
6. video broken after resume due to:

[AO_ALSA] Pcm in suspend mode, trying to resume. 212%  2%  1.7% 1 0 
[AO_ALSA] alsa-lib: pcm_hw.c:709:(snd_pcm_hw_resume) SNDRV_PCM_IOCTL_RESUME failed: Function not implemented

i.e., unrelated... still ok

7. ubuntu guest downloading a 100Mb file causing allocation of a bunch of anon memory in the host... ok
8. all guests off, firefox off, back to light load... ok

No oopsies or problems in dmesg except the old lockdep sysfs warning.

I will keep running that kernel in the next couple of days and keep you
informed in case this is the fix we're gonna use.

-- 
Regards/Gruss,
    Boris.

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA
  2010-04-10  7:27                                                                                                   ` Borislav Petkov
@ 2010-04-10 11:26                                                                                                     ` Borislav Petkov
  2010-04-10 14:45                                                                                                       ` Rik van Riel
  2010-04-10 15:24                                                                                                       ` Linus Torvalds
  0 siblings, 2 replies; 242+ messages in thread
From: Borislav Petkov @ 2010-04-10 11:26 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Johannes Weiner, KOSAKI Motohiro, Rik van Riel, Andrew Morton,
	Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn,
	Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson

From: Borislav Petkov <bp@alien8.de>
Date: Sat, Apr 10, 2010 at 09:27:14AM +0200

> Now why would you go and jinx it like that... :)
> 
> Hibernation runs back-to-back:
> 
> 1. light system load after boot... ok
> 2. 3 kvm guests, 3Gb mem free of 8Gb total acc. to /proc/meminfo... ok			[ this was the fireproof way to trigger the bug, btw]
> 3. kvm guests down, firefox loading a 4Mb html page... ok
> 4. start ubuntu guest, firefox keeps loading the 4Mb html page after previous resume... ok
> 5. ubuntu guest booting done, firefox done, play video... ok
> 6. video broken after resume due to:
> 
> [AO_ALSA] Pcm in suspend mode, trying to resume. 212%  2%  1.7% 1 0 
> [AO_ALSA] alsa-lib: pcm_hw.c:709:(snd_pcm_hw_resume) SNDRV_PCM_IOCTL_RESUME failed: Function not implemented
> 
> i.e., unrelated... still ok
> 
> 7. ubuntu guest downloading a 100Mb file causing allocation of a bunch of anon memory in the host... ok
> 8. all guests off, firefox off, back to light load... ok
> 
> No oopsies or problems in dmesg except the old lockdep sysfs warning.
> 
> I will keep running that kernel in the next couple of days and keep you
> informed in case this is the fix we're gonna use.

Yep, you jinxed it :)

This time we got stuck on the anon_vma->lock (yep, we've seen that
oopsie before). So, it might be that we _really_ are staring at the
wrong code... Back to square one.


[18969.797126] BUG: soft lockup - CPU#1 stuck for 61s! [hib.sh:5605]
[18969.797126] Modules linked in: powernow_k8 cpufreq_ondemand cpufreq_powersave cpufreq_userspace freq_table cpufreq_conservative binfmt_misc kvm_amd kvm ipv6 vfat fat dm_crypt dm_mod 8250_pnp 8250 ohci_hcd pcspkr serial_core k10temp edac_core
[18969.798029] irq event stamp: 0
[18969.798029] hardirqs last  enabled at (0): [<(null)>] (null)
[18969.798029] hardirqs last disabled at (0): [<ffffffff8103657c>] copy_process+0x3c1/0x10cc
[18969.798029] softirqs last  enabled at (0): [<ffffffff8103657c>] copy_process+0x3c1/0x10cc
[18969.798029] softirqs last disabled at (0): [<(null)>] (null)
[18969.798029] CPU 1 
[18969.798029] Modules linked in: powernow_k8 cpufreq_ondemand cpufreq_powersave cpufreq_userspace freq_table cpufreq_conservative binfmt_misc kvm_amd kvm ipv6 vfat fat dm_crypt dm_mod 8250_pnp 8250 ohci_hcd pcspkr serial_core k10temp edac_core
[18969.798029] 
[18969.798029] Pid: 5605, comm: hib.sh Not tainted 2.6.34-rc3-00501-gefb57c0 #1 M3A78 PRO/System Product Name
[18969.798029] RIP: 0010:[<ffffffff8118b7f4>]  [<ffffffff8118b7f4>] delay_tsc+0x33/0xca
[18969.798029] RSP: 0018:ffff8801aebdf7b8  EFLAGS: 00000206
[18969.798029] RAX: 00000000fc6fc9e8 RBX: ffff8801aebdf7e8 RCX: 0000000000001200
[18969.798029] RDX: 0000000000002806 RSI: ffff8801aebdf848 RDI: 0000000000000001
[18969.798029] RBP: ffffffff81002b4e R08: 0000000000000001 R09: 0000000000000000
[18969.798029] R10: ffff8801aebdf8a8 R11: 0000000000000001 R12: 0000000000000014
[18969.798029] R13: ffff88000a200000 R14: ffff8801aebde000 R15: ffff8801aebdffd8
[18969.798029] FS:  00007f2c86c656f0(0000) GS:ffff88000a200000(0000) knlGS:0000000000000000
[18969.798029] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[18969.798029] CR2: 00007fd515101870 CR3: 000000022bd9a000 CR4: 00000000000006e0
[18969.798029] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[18969.798029] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[18969.798029] Process hib.sh (pid: 5605, threadinfo ffff8801aebde000, task ffff88022e194b80)
[18969.798029] Stack:
[18969.798029]  0000000000000001 ffff88022d2db720 ffff88022e194b80 00000000b3477260
[18969.798029] <0> ffff88022e194f28 000000002a5200c6 ffff8801aebdf7f8 ffffffff8118b7bf
[18969.798029] <0> ffff8801aebdf848 ffffffff8119a296 ffff88022d2db738 0000000000000001
[18969.798029] Call Trace:
[18969.798029]  [<ffffffff8118b7bf>] ? __delay+0xf/0x11
[18969.798029]  [<ffffffff8119a296>] ? do_raw_spin_lock+0xd2/0x13c
[18969.798029]  [<ffffffff813f843b>] ? _raw_spin_lock+0x60/0x73
[18969.798029]  [<ffffffff810c0ae3>] ? page_lock_anon_vma+0x63/0xac
[18969.798029]  [<ffffffff810c0ae3>] ? page_lock_anon_vma+0x63/0xac
[18969.798029]  [<ffffffff810c0a80>] ? page_lock_anon_vma+0x0/0xac
[18969.798029]  [<ffffffff810c0cc9>] ? page_referenced+0x80/0x1dc
[18969.798029]  [<ffffffff810c60a0>] ? swapcache_free+0x37/0x3c
[18969.798029]  [<ffffffff810ab7e6>] ? shrink_page_list+0x14a/0x477
[18969.798029]  [<ffffffff810abe6a>] ? shrink_inactive_list+0x357/0x5e5
[18969.798029]  [<ffffffff810ab68a>] ? shrink_active_list+0x232/0x244
[18969.798029]  [<ffffffff810ac404>] ? shrink_zone+0x30c/0x3d6
[18969.798029]  [<ffffffff810acfdf>] ? do_try_to_free_pages+0x176/0x27f
[18969.798029]  [<ffffffff810ad17d>] ? shrink_all_memory+0x95/0xc4
[18969.798029]  [<ffffffff810aa680>] ? isolate_pages_global+0x0/0x1f0
[18969.798029]  [<ffffffff81076e80>] ? count_data_pages+0x65/0x79
[18969.798029]  [<ffffffff810770e7>] ? hibernate_preallocate_memory+0x1aa/0x2cb
[18969.798029]  [<ffffffff813f5445>] ? printk+0x41/0x44
[18969.798029]  [<ffffffff81075a87>] ? hibernation_snapshot+0x36/0x1e1
[18969.798029]  [<ffffffff81075d00>] ? hibernate+0xce/0x172
[18969.798029]  [<ffffffff81074a6d>] ? state_store+0x5c/0xd3
[18969.798029]  [<ffffffff8118504b>] ? kobj_attr_store+0x17/0x19
[18969.798029]  [<ffffffff81125e8b>] ? sysfs_write_file+0x108/0x144
[18969.798029]  [<ffffffff810d5807>] ? vfs_write+0xb2/0x153
[18969.798029]  [<ffffffff81063c0d>] ? trace_hardirqs_on_caller+0x1f/0x14b
[18969.798029]  [<ffffffff810d596b>] ? sys_write+0x4a/0x71
[18969.798029]  [<ffffffff810021db>] ? system_call_fastpath+0x16/0x1b
[18969.798029] Code: 41 55 41 54 53 48 83 ec 08 0f 1f 44 00 00 49 89 fc bf 01 00 00 00 e8 88 1d ea ff e8 db f4 00 00 41 89 c5 0f ae f0 66 66 90 0f 31 <89> c3 65 4c 8b 34 25 48 b5 00 00 0f ae f0 66 66 90 0f 31 41 89 
[18969.798029] Call Trace:
[18969.798029]  [<ffffffff8118b7bf>] ? __delay+0xf/0x11
[18969.798029]  [<ffffffff8119a296>] ? do_raw_spin_lock+0xd2/0x13c
[18969.798029]  [<ffffffff813f843b>] ? _raw_spin_lock+0x60/0x73
[18969.798029]  [<ffffffff810c0ae3>] ? page_lock_anon_vma+0x63/0xac
[18969.798029]  [<ffffffff810c0ae3>] ? page_lock_anon_vma+0x63/0xac
[18969.798029]  [<ffffffff810c0a80>] ? page_lock_anon_vma+0x0/0xac
[18969.798029]  [<ffffffff810c0cc9>] ? page_referenced+0x80/0x1dc
[18969.798029]  [<ffffffff810c60a0>] ? swapcache_free+0x37/0x3c
[18969.798029]  [<ffffffff810ab7e6>] ? shrink_page_list+0x14a/0x477
[18969.798029]  [<ffffffff810abe6a>] ? shrink_inactive_list+0x357/0x5e5
[18969.798029]  [<ffffffff810ab68a>] ? shrink_active_list+0x232/0x244
[18969.798029]  [<ffffffff810ac404>] ? shrink_zone+0x30c/0x3d6
[18969.798029]  [<ffffffff810acfdf>] ? do_try_to_free_pages+0x176/0x27f
[18969.798029]  [<ffffffff810ad17d>] ? shrink_all_memory+0x95/0xc4
[18969.798029]  [<ffffffff810aa680>] ? isolate_pages_global+0x0/0x1f0
[18969.798029]  [<ffffffff81076e80>] ? count_data_pages+0x65/0x79
[18969.798029]  [<ffffffff810770e7>] ? hibernate_preallocate_memory+0x1aa/0x2cb
[18969.798029]  [<ffffffff813f5445>] ? printk+0x41/0x44
[18969.798029]  [<ffffffff81075a87>] ? hibernation_snapshot+0x36/0x1e1
[18969.798029]  [<ffffffff81075d00>] ? hibernate+0xce/0x172
[18969.798029]  [<ffffffff81074a6d>] ? state_store+0x5c/0xd3
[18969.798029]  [<ffffffff8118504b>] ? kobj_attr_store+0x17/0x19
[18969.798029]  [<ffffffff81125e8b>] ? sysfs_write_file+0x108/0x144
[18969.798029]  [<ffffffff810d5807>] ? vfs_write+0xb2/0x153
[18969.798029]  [<ffffffff81063c0d>] ? trace_hardirqs_on_caller+0x1f/0x14b
[18969.798029]  [<ffffffff810d596b>] ? sys_write+0x4a/0x71
[18969.798029]  [<ffffffff810021db>] ? system_call_fastpath+0x16/0x1b
[19005.426655] SysRq : HELP : loglevel(0-9) reBoot Crash show-all-locks(D) terminate-all-tasks(E) memory-full-oom-kill(F) kill-all-tasks(I) thaw-filesystems(J) saK show-backtrace-all-active-cpus(L) show-memory-usage(M) nice-all-RT-tasks(N) powerOff show-registers(P) show-all-timers(Q) unRaw Sync show-task-states(T) Unmount show-blocked-tasks(W) dump-ftrace-buffer(Z) 
[19005.663484] SysRq : HELP : loglevel(0-9) reBoot Crash show-all-locks(D) terminate-all-tasks(E) memory-full-oom-kill(F) kill-all-tasks(I) thaw-filesystems(J) saK show-backtrace-all-active-cpus(L) show-memory-usage(M) nice-all-RT-tasks(N) powerOff show-registers(P) show-all-timers(Q) unRaw Sync show-task-states(T) Unmount show-blocked-tasks(W) dump-ftrace-buffer(Z) 
[19007.018563] SysRq : Emergency Sync
[19007.018969] Emergency Sync complete
[19007.582218] SysRq : Emergency Remount R/O
[19008.251934] SysRq : Power Off
[19010.076146] SysRq : Resetting


-- 
Regards/Gruss,
Boris.

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA
  2010-04-10 11:26                                                                                                     ` Borislav Petkov
@ 2010-04-10 14:45                                                                                                       ` Rik van Riel
  2010-04-10 15:24                                                                                                       ` Linus Torvalds
  1 sibling, 0 replies; 242+ messages in thread
From: Rik van Riel @ 2010-04-10 14:45 UTC (permalink / raw)
  To: Borislav Petkov, Linus Torvalds, Johannes Weiner,
	KOSAKI Motohiro, Andrew Morton, Minchan Kim,
	Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin,
	Andrea Arcangeli, Hugh Dickins, sgunderson

On 04/10/2010 07:26 AM, Borislav Petkov wrote:

> This time we got stuck on the anon_vma->lock (yep, we've seen that
> oopsie before). So, it might be that we _really_ are staring at the
> wrong code... Back to square one.

This is a different bug, though.

If the null pointer dereference is gone, Linus's patch
fixed that bug and we can move forward to fixing the
anon_vma->lock bug.

I'll start auditing the code to see if we forget to
unlock the anon_vma in some unlikely error path...

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA
  2010-04-10 11:26                                                                                                     ` Borislav Petkov
  2010-04-10 14:45                                                                                                       ` Rik van Riel
@ 2010-04-10 15:24                                                                                                       ` Linus Torvalds
  2010-04-10 16:38                                                                                                         ` Borislav Petkov
  2010-04-10 16:41                                                                                                         ` Linus Torvalds
  1 sibling, 2 replies; 242+ messages in thread
From: Linus Torvalds @ 2010-04-10 15:24 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Johannes Weiner, KOSAKI Motohiro, Rik van Riel, Andrew Morton,
	Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn,
	Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson



On Sat, 10 Apr 2010, Borislav Petkov wrote:
> > 
> > I will keep running that kernel in the next couple of days and keep you
> > informed in case this is the fix we're gonna use.
> 
> Yep, you jinxed it :)
> 
> This time we got stuck on the anon_vma->lock (yep, we've seen that
> oopsie before). So, it might be that we _really_ are staring at the
> wrong code... Back to square one.

No, I think we're good. I suspect this is a different issue. Do you have 
lockdep enabled, along with mutex and spinlock debugging etc? That might 
help pinpoint what triggers this.

But I think the fact that you are apparently not able to get the list 
corruption is a good sign. Of course, it might just be harder to trigger, 
and these things could all be a sign of a different bug, but my gut feel 
is that we did fix something, and you are just damn good at stressing the 
new code. Kudos.

		Linus

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA
  2010-04-10 15:24                                                                                                       ` Linus Torvalds
@ 2010-04-10 16:38                                                                                                         ` Borislav Petkov
  2010-04-10 17:05                                                                                                           ` Linus Torvalds
  2010-04-10 17:07                                                                                                           ` Borislav Petkov
  2010-04-10 16:41                                                                                                         ` Linus Torvalds
  1 sibling, 2 replies; 242+ messages in thread
From: Borislav Petkov @ 2010-04-10 16:38 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Johannes Weiner, KOSAKI Motohiro, Rik van Riel, Andrew Morton,
	Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn,
	Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson

From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Sat, Apr 10, 2010 at 08:24:02AM -0700

> No, I think we're good. I suspect this is a different issue. Do you have 
> lockdep enabled, along with mutex and spinlock debugging etc? That might 
> help pinpoint what triggers this.

I had pretty much all lock debugging options enabled except PROVE_RCU.

> But I think the fact that you are apparently not able to get the list 
> corruption is a good sign. Of course, it might just be harder to trigger, 
> and these things could all be a sign of a different bug, but my gut feel 
> is that we did fix something, and you are just damn good at stressing the 
> new code. Kudos.

Yep, even my mom says I'm good at breaking things :) But seriously,
thanks - means a lot coming from you.

And I got an oops again, this time the #GP from couple of days ago.

<thinking out loud>

I'm starting to think that maybe there could be something wrong with the
machine I'm running it on. Especially since there are only two people
who reported this issue, Steinar and me, so how probable is it that
maybe those two machines have failing RAM module somewhere? Or some
other data corrupting thing? Although I should be getting mchecks...
Hmm...

</thinking out loud>

Im going to run the stress test on 2.6.33.2 to verify whether this is
actually software-related. Just in case. Oh, yes, I almost forgot, the
latest and greatest in the world of oopsies:


[  452.351588] general protection fault: 0000 [#1] PREEMPT SMP 
[  452.352119] last sysfs file: /sys/power/state
[  452.352131] CPU 1 
[  452.352131] Modules linked in: powernow_k8 cpufreq_ondemand cpufreq_powersave cpufreq_userspace freq_table cpufreq_conservative binfmt_misc kvm_amd kvm ipv6 vfat fat dm_crypt dm_mod 8250_pnp 8250 edac_core serial_core ohci_hcd pcspkr k10temp
[  452.352131] 
[  452.352131] Pid: 2929, comm: hib.sh Not tainted 2.6.34-rc3-00501-gefb57c0 #4 M3A78 PRO/System Product Name
[  452.352131] RIP: 0010:[<ffffffff810c5f00>]  [<ffffffff810c5f00>] page_referenced+0xee/0x1dc
[  452.352131] RSP: 0018:ffff88022adb18b8  EFLAGS: 00010206
[  452.352131] RAX: ffff88022ad5c468 RBX: ffffea0007598558 RCX: 0000000000000000
[  452.352131] RDX: ffff88022adb1cf8 RSI: ffff88022ad5c440 RDI: ffff88022e7d38a0
[  452.352131] RBP: ffff88022adb1938 R08: 0000000000000002 R09: 0000000000000000
[  452.352131] R10: ffff88022be83868 R11: ffffffff00000012 R12: 0000000000000000
[  452.352131] R13: 0032323200323212 R14: ffff88022ad5c428 R15: ffff88022adb1a00
[  452.352131] FS:  00007f056a1e36f0(0000) GS:ffff88000a200000(0000) knlGS:0000000000000000
[  452.352131] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[  452.352131] CR2: 000000000250e408 CR3: 000000022983f000 CR4: 00000000000006e0
[  452.352131] DR0: 00000000000000a0 DR1: 0000000000000000 DR2: 0000000000000003
[  452.352131] DR3: 00000000000000b0 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[  452.352131] Process hib.sh (pid: 2929, threadinfo ffff88022adb0000, task ffff88022e7d38a0)
[  452.352131] Stack:
[  452.352131]  ffff88022ad5c468 00000000810c5c1f ffff88022adb1918 ffffffff810c5d88
[  452.352131] <0> ffff88022adb18f8 ffffffff00000001 ffffea00075c89c0 ffffea00075984e8
[  452.352131] <0> ffffea00075984e8 000000022adb1cf8 ffffea00075984e8 ffffea0007598580
[  452.352131] Call Trace:
[  452.352131]  [<ffffffff810c5d88>] ? try_to_unmap_anon+0xa2/0xb4
[  452.352131]  [<ffffffff810b06bc>] shrink_page_list+0x154/0x4c7
[  452.352131]  [<ffffffff81067149>] ? print_lock_contention_bug+0x1b/0xe1
[  452.352131]  [<ffffffff810af59c>] ? isolate_pages_global+0xd0/0x1fc
[  452.352131]  [<ffffffff8140f856>] ? _raw_spin_unlock_irq+0x30/0x58
[  452.352131]  [<ffffffff810b0d8a>] shrink_inactive_list+0x35b/0x60c
[  452.352131]  [<ffffffff810b1347>] shrink_zone+0x30c/0x3d6
[  452.352131]  [<ffffffff810b1f3d>] do_try_to_free_pages+0x191/0x29a
[  452.352131]  [<ffffffff810b20db>] shrink_all_memory+0x95/0xc4
[  452.352131]  [<ffffffff810af4cc>] ? isolate_pages_global+0x0/0x1fc
[  452.352131]  [<ffffffff81079c9c>] ? count_data_pages+0x65/0x79
[  452.352131]  [<ffffffff81079f03>] hibernate_preallocate_memory+0x1aa/0x2cb
[  452.352131]  [<ffffffff8140bbd4>] ? printk+0x41/0x45
[  452.352131]  [<ffffffff8107878f>] hibernation_snapshot+0x36/0x1e1
[  452.352131]  [<ffffffff81078a08>] hibernate+0xce/0x172
[  452.352131]  [<ffffffff81077775>] state_store+0x5c/0xd3
[  452.352131]  [<ffffffff8118f3cf>] kobj_attr_store+0x17/0x19
[  452.352131]  [<ffffffff8112e288>] sysfs_write_file+0x108/0x144
[  452.352131]  [<ffffffff810db4ff>] vfs_write+0xb2/0x153
[  452.352131]  [<ffffffff810663c9>] ? trace_hardirqs_on_caller+0x1f/0x14b
[  452.352131]  [<ffffffff810db663>] sys_write+0x4a/0x71
[  452.352131]  [<ffffffff8100221b>] system_call_fastpath+0x16/0x1b
[  452.352131] Code: 3b 56 10 73 1e 48 83 fa f2 74 18 48 8d 4d cc 4d 89 f8 48 89 df e8 11 f2 ff ff 41 01 c4 83 7d cc 00 74 19 4d 8b 6d 20 49 83 ed 20 <49> 8b 45 20 0f 18 08 49 8d 45 20 48 39 45 80 75 aa 4c 89 f7 e8 
[  452.352131] RIP  [<ffffffff810c5f00>] page_referenced+0xee/0x1dc
[  452.352131]  RSP <ffff88022adb18b8>
[  452.368192] ---[ end trace a9c84cb81ab9fd41 ]---
[  452.368372] note: hib.sh[2929] exited with preempt_count 2
[  452.368564] BUG: scheduling while atomic: hib.sh/2929/0x10000003
[  452.368742] INFO: lockdep is turned off.
[  452.368915] Modules linked in: powernow_k8 cpufreq_ondemand cpufreq_powersave cpufreq_userspace freq_table cpufreq_conservative binfmt_misc kvm_amd kvm ipv6 vfat fat dm_crypt dm_mod 8250_pnp 8250 edac_core serial_core ohci_hcd pcspkr k10temp
[  452.370749] Pid: 2929, comm: hib.sh Tainted: G      D    2.6.34-rc3-00501-gefb57c0 #4
[  452.371051] Call Trace:
[  452.371239]  [<ffffffff810658df>] ? __debug_show_held_locks+0x1b/0x24
[  452.371425]  [<ffffffff8102dfac>] __schedule_bug+0x72/0x77
[  452.371608]  [<ffffffff8140bfe8>] schedule+0xe3/0x7ff
[  452.371788]  [<ffffffff810bd066>] ? unmap_vmas+0x88e/0x893
[  452.371973]  [<ffffffff81030ecb>] __cond_resched+0x18/0x24
[  452.372168]  [<ffffffff8140c7d1>] _cond_resched+0x2c/0x37
[  452.372348]  [<ffffffff810bcea6>] unmap_vmas+0x6ce/0x893
[  452.372531]  [<ffffffff8140f8b6>] ? _raw_spin_unlock_irqrestore+0x38/0x69
[  452.372721]  [<ffffffff810c1604>] exit_mmap+0xd7/0x182
[  452.372903]  [<ffffffff810368bc>] mmput+0x48/0xb9
[  452.373088]  [<ffffffff8103ad90>] exit_mm+0x110/0x11d
[  452.373284]  [<ffffffff8103c9e6>] do_exit+0x1c5/0x6e5
[  452.373464]  [<ffffffff81039e2f>] ? kmsg_dump+0x13b/0x155
[  452.373645]  [<ffffffff8100616b>] ? oops_end+0x47/0x93
[  452.373826]  [<ffffffff810061b2>] oops_end+0x8e/0x93
[  452.374006]  [<ffffffff810063a3>] die+0x5a/0x63
[  452.374198]  [<ffffffff81003eef>] do_general_protection+0x134/0x13c
[  452.374382]  [<ffffffff8140fdb0>] ? irq_return+0x0/0x2
[  452.374565]  [<ffffffff8140ff8f>] general_protection+0x1f/0x30
[  452.374754]  [<ffffffff810c5f00>] ? page_referenced+0xee/0x1dc
[  452.374940]  [<ffffffff810c5e92>] ? page_referenced+0x80/0x1dc
[  452.375147]  [<ffffffff810c5d88>] ? try_to_unmap_anon+0xa2/0xb4
[  452.375335]  [<ffffffff810b06bc>] shrink_page_list+0x154/0x4c7
[  452.375519]  [<ffffffff81067149>] ? print_lock_contention_bug+0x1b/0xe1
[  452.375703]  [<ffffffff810af59c>] ? isolate_pages_global+0xd0/0x1fc
[  452.375888]  [<ffffffff8140f856>] ? _raw_spin_unlock_irq+0x30/0x58
[  452.376080]  [<ffffffff810b0d8a>] shrink_inactive_list+0x35b/0x60c
[  452.376284]  [<ffffffff810b1347>] shrink_zone+0x30c/0x3d6
[  452.376476]  [<ffffffff810b1f3d>] do_try_to_free_pages+0x191/0x29a
[  452.376664]  [<ffffffff810b20db>] shrink_all_memory+0x95/0xc4
[  452.376852]  [<ffffffff810af4cc>] ? isolate_pages_global+0x0/0x1fc
[  452.377038]  [<ffffffff81079c9c>] ? count_data_pages+0x65/0x79
[  452.377238]  [<ffffffff81079f03>] hibernate_preallocate_memory+0x1aa/0x2cb
[  452.377429]  [<ffffffff8140bbd4>] ? printk+0x41/0x45
[  452.377611]  [<ffffffff8107878f>] hibernation_snapshot+0x36/0x1e1
[  452.377794]  [<ffffffff81078a08>] hibernate+0xce/0x172
[  452.377975]  [<ffffffff81077775>] state_store+0x5c/0xd3
[  452.378170]  [<ffffffff8118f3cf>] kobj_attr_store+0x17/0x19
[  452.378351]  [<ffffffff8112e288>] sysfs_write_file+0x108/0x144
[  452.378533]  [<ffffffff810db4ff>] vfs_write+0xb2/0x153
[  452.378714]  [<ffffffff810663c9>] ? trace_hardirqs_on_caller+0x1f/0x14b
[  452.378898]  [<ffffffff810db663>] sys_write+0x4a/0x71
[  452.379084]  [<ffffffff8100221b>] system_call_fastpath+0x16/0x1b

-- 
Regards/Gruss,
Boris.

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA
  2010-04-10 15:24                                                                                                       ` Linus Torvalds
  2010-04-10 16:38                                                                                                         ` Borislav Petkov
@ 2010-04-10 16:41                                                                                                         ` Linus Torvalds
  2010-04-10 22:49                                                                                                           ` Johannes Weiner
  1 sibling, 1 reply; 242+ messages in thread
From: Linus Torvalds @ 2010-04-10 16:41 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Johannes Weiner, KOSAKI Motohiro, Rik van Riel, Andrew Morton,
	Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn,
	Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson



On Sat, 10 Apr 2010, Linus Torvalds wrote:
> 
> But I think the fact that you are apparently not able to get the list 
> corruption is a good sign. Of course, it might just be harder to trigger, 
> and these things could all be a sign of a different bug, but my gut feel 
> is that we did fix something, and you are just damn good at stressing the 
> new code. Kudos.

Btw, I do hate the current 'find_mergeable_anon_vma()' with its duplicated 
checks for prev/next compatibility that I just made even more complex.

So I'm actually inclined to want to write my simple two-liner fix as a 
rather more complex cleanup patch, below.

It adds way more lines than it deletes, but a lot of it is comments (and 
some of it is just because one routine got split up into three), and I 
think it makes the result a lot more readable.

It also splits off the decision of whether we can reuse an non_vma from 
the decision of whether we can merge the vma's - the two are kind of 
related, but they are not really the same, and they have different issues. 
I think it's good to try to keep separate issues separate.

This is UNTESTED! It's meant to be an "obvious cleanup" with no real 
semantic difference, but if I did something wrong it won't work. Also note 
the comment about the lack of locking between two adjacent anon_vma's 
taking a page fault at the same time: the ACCESS_ONCE() is unlikely to 
ever matter (anon_vma's are stable once they are set, so it's really just 
that you could first load a NULL, and then if you re-load the value you 
might get a non-NULL thing).

Also note that when checking whether the anon_vma is a singleton, we don't 
hold any lock that protects the list we are checking. But 
"list_is_singular()" is safe and won't oops even if the pointers in the 
list are crap, because it only _compares_ the prev/next pointers, it 
doesn't dereference them.

In short, what I'm saying is that there is a pretty subtle race in the 
very very unlikely case that two anon_vma's get prepared concurrently, but 
from a correctness standpoint it doesn't matter. We might sometimes - once 
in a blue moon - reject an anon_vma that could in theory have been merged, 
but that won't hurt.

Comments? Rik, Johannes?

			Linus

---
 mm/mmap.c |   86 ++++++++++++++++++++++++++++++++++++++++++++-----------------
 1 files changed, 62 insertions(+), 24 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index 75557c6..acb023e 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -825,6 +825,61 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm,
 }
 
 /*
+ * Rough compatbility check to quickly see if it's even worth looking
+ * at sharing an anon_vma.
+ *
+ * They need to have the same vm_file, and the flags can only differ
+ * in things that mprotect may change.
+ *
+ * NOTE! The fact that we share an anon_vma doesn't _have_ to mean that
+ * we can merge the two vma's. For example, we refuse to merge a vma if
+ * there is a vm_ops->close() function, because that indicates that the
+ * driver is doing some kind of reference counting. But that doesn't
+ * really matter for the anon_vma sharing case.
+ */
+static int anon_vma_compatible(struct vm_area_struct *a, struct vm_area_struct *b)
+{
+	return a->vm_end == b->vm_start &&
+		mpol_equal(vma_policy(a), vma_policy(b)) &&
+		a->vm_file == b->vm_file &&
+		!((a->vm_flags ^ b->vm_flags) & ~(VM_READ|VM_WRITE|VM_EXEC)) &&
+		b->vm_pgoff == a->vm_pgoff + ((b->vm_start - a->vm_start) >> PAGE_SHIFT);
+}
+
+/*
+ * Do some basic sanity checking to see if we can re-use the anon_vma
+ * from 'old'. The 'a'/'b' vma's are in VM order - one of them will be
+ * the same as 'old', the other will be the new one that is trying
+ * to share the anon_vma.
+ *
+ * NOTE! This runs with mm_sem held for reading, so it is possible that
+ * the anon_vma of 'old' is concurrently in the process of being set up
+ * by another page fault trying to merge _that_. But that's ok: if it
+ * is being set up, that automatically means that it will be a singleton
+ * acceptable for merging, so we can do all of this optimistically. But
+ * we do that ACCESS_ONCE() to make sure that we never re-load the pointer.
+ *
+ * IOW: that the "list_is_singular()" test on the anon_vma_chain only
+ * matters for the 'stable anon_vma' case (ie the thing we want to avoid
+ * is to return an anon_vma that is "complex" due to having gone through
+ * a fork).
+ *
+ * We also make sure that the two vma's are compatible (adjacent,
+ * and with the same memory policies). That's all stable, even with just
+ * a read lock on the mm_sem.
+ */
+static struct anon_vma *reusable_anon_vma(struct vm_area_struct *old, struct vm_area_struct *a, struct vm_area_struct *b)
+{
+	if (anon_vma_compatible(a, b)) {
+		struct anon_vma *anon_vma = ACCESS_ONCE(old->anon_vma);
+
+		if (anon_vma && list_is_singular(&old->anon_vma_chain))
+			return anon_vma;
+	}
+	return NULL;
+}
+
+/*
  * find_mergeable_anon_vma is used by anon_vma_prepare, to check
  * neighbouring vmas for a suitable anon_vma, before it goes off
  * to allocate a new anon_vma.  It checks because a repetitive
@@ -834,28 +889,16 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm,
  */
 struct anon_vma *find_mergeable_anon_vma(struct vm_area_struct *vma)
 {
+	struct anon_vma *anon_vma;
 	struct vm_area_struct *near;
-	unsigned long vm_flags;
 
 	near = vma->vm_next;
 	if (!near)
 		goto try_prev;
 
-	/*
-	 * Since only mprotect tries to remerge vmas, match flags
-	 * which might be mprotected into each other later on.
-	 * Neither mlock nor madvise tries to remerge at present,
-	 * so leave their flags as obstructing a merge.
-	 */
-	vm_flags = vma->vm_flags & ~(VM_READ|VM_WRITE|VM_EXEC);
-	vm_flags |= near->vm_flags & (VM_READ|VM_WRITE|VM_EXEC);
-
-	if (near->anon_vma && vma->vm_end == near->vm_start &&
- 			mpol_equal(vma_policy(vma), vma_policy(near)) &&
-			can_vma_merge_before(near, vm_flags,
-				NULL, vma->vm_file, vma->vm_pgoff +
-				((vma->vm_end - vma->vm_start) >> PAGE_SHIFT)))
-		return near->anon_vma;
+	anon_vma = reusable_anon_vma(near, vma, near);
+	if (anon_vma)
+		return anon_vma;
 try_prev:
 	/*
 	 * It is potentially slow to have to call find_vma_prev here.
@@ -868,14 +911,9 @@ try_prev:
 	if (!near)
 		goto none;
 
-	vm_flags = vma->vm_flags & ~(VM_READ|VM_WRITE|VM_EXEC);
-	vm_flags |= near->vm_flags & (VM_READ|VM_WRITE|VM_EXEC);
-
-	if (near->anon_vma && near->vm_end == vma->vm_start &&
-  			mpol_equal(vma_policy(near), vma_policy(vma)) &&
-			can_vma_merge_after(near, vm_flags,
-				NULL, vma->vm_file, vma->vm_pgoff))
-		return near->anon_vma;
+	anon_vma = reusable_anon_vma(near, near, vma);
+	if (anon_vma)
+		return anon_vma;
 none:
 	/*
 	 * There's no absolute need to look only at touching neighbours:

^ permalink raw reply related	[flat|nested] 242+ messages in thread

* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA
  2010-04-10 16:38                                                                                                         ` Borislav Petkov
@ 2010-04-10 17:05                                                                                                           ` Linus Torvalds
  2010-04-10 18:21                                                                                                             ` Linus Torvalds
  2010-04-10 17:07                                                                                                           ` Borislav Petkov
  1 sibling, 1 reply; 242+ messages in thread
From: Linus Torvalds @ 2010-04-10 17:05 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Johannes Weiner, KOSAKI Motohiro, Rik van Riel, Andrew Morton,
	Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn,
	Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson


On Sat, 10 Apr 2010, Borislav Petkov wrote:
> 
> And I got an oops again, this time the #GP from couple of days ago.

Oh damn. So the list corruption really does happen still.

And the pattern is similar, but not the same: now it's 0032323200323232, 
rather than 002e2e2e002e2e2e. Very intriguing. 0x32 instead of 0x2e, but 
the same pattern of duplicated bytes. And not very helpful in that it 
still doesn't actually make any sense.

> <thinking out loud>
> 
> I'm starting to think that maybe there could be something wrong with the
> machine I'm running it on. Especially since there are only two people
> who reported this issue, Steinar and me, so how probable is it that
> maybe those two machines have failing RAM module somewhere? Or some
> other data corrupting thing? Although I should be getting mchecks...
> Hmm...

No. Just the fact that there are two people who reported the same 
thing is already a pretty strong sign that it's real. Also, hardware 
problems don't tend to be as consistent in the details as yours have 
been. 

And in fact I have seen it personally (but couldn't reproduce it) on the 
kids mac mini after you reported it.

So I'm convinced the problem is real, and just not so easily 
triggered, and you're being a great tester.

			Linus
--
Here's the one I've seen, in case you care.  I haven't posted it, because 
it doesn't really add anything new.

 BUG: unable to handle kernel NULL pointer dereference at (null)
 IP: [<c02850cf>] page_referenced+0xd6/0x199
 *pde = 21d73067 *pte = 00000000 
 Oops: 0000 [#2] SMP 
 last sysfs file: /sys/devices/pci0000:00/0000:00:1f.2/host2/target2:0:0/2:0:0:0/block/sda/uevent
 Modules linked in: [last unloaded: scsi_wait_scan]
 
 Pid: 14440, comm: firefox Tainted: G      D    2.6.34-rc2-00391-gfc1203c #3 Mac-F4208EC8/Macmini1,1
 EIP: 0060:[<c02850cf>] EFLAGS: 00210287 CPU: 1
 EIP is at page_referenced+0xd6/0x199
 EAX: f59e65d4 EBX: c10b5480 ECX: 00000000 EDX: fffffff0
 ESI: f59e65d0 EDI: 00000000 EBP: d8f77cd8 ESP: d8f77ca0
  DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
 Process firefox (pid: 14440, ti=d8f76000 task=cb795440 task.ti=d8f76000)
 Stack:
  f59e65d4 00000000 fffffff0 c15ba000 d8f77cbc c02885b8 c07972c4 d8f77cdc
  c0276712 00000000 00000001 c10b5498 c10b5480 d8f77e94 d8f77d58 c0276b53
  d8f77d48 00000000 00000000 00000000 0000001d d8f77de8 00000001 c07972c4
 Call Trace:
  [<c02885b8>] ? swapcache_free+0x1b/0x24
  [<c0276712>] ? __remove_mapping+0x90/0xb2
  [<c0276b53>] ? shrink_page_list+0x109/0x3ba
  [<c0277099>] ? shrink_inactive_list+0x295/0x48e
  [<c0273d68>] ? determine_dirtyable_memory+0x34/0x4b
  [<c0273dd0>] ? get_dirty_limits+0x16/0x26d
  [<c027750c>] ? shrink_zone+0x27a/0x327
  [<c03c55a5>] ? i915_gem_shrink+0x67/0x22c
  [<c0277e6d>] ? do_try_to_free_pages+0x17d/0x292
  [<c0278078>] ? try_to_free_pages+0x6a/0x72
  [<c0275cd7>] ? isolate_pages_global+0x0/0x1bd
  [<c0273210>] ? __alloc_pages_nodemask+0x2c2/0x447
  [<c027f1c1>] ? handle_mm_fault+0x188/0x605
  [<c02192c3>] ? do_page_fault+0x253/0x269
  [<c0219070>] ? do_page_fault+0x0/0x269
  [<c05b9e82>] ? error_code+0x66/0x6c
  [<c05b0000>] ? azx_probe+0x5e8/0x8ae
  [<c0219070>] ? do_page_fault+0x0/0x269
 Code: f9 f2 74 18 ff 75 08 8d 45 f0 50 89 d8 e8 62 f6 ff ff 01 c7 59 83 7d f0 00 58 74 20 8b 55 d0 8b 42 10 83 e8 10 89 45 d0 8b 55 d0 <8b> 42 10 0f 18 00 90 89 d0 83 c0 10 39 45 c8 75 ab fe 06 e9 90 
 EIP: [<c02850cf>] page_referenced+0xd6/0x199 SS:ESP 0068:d8f77ca0
 CR2: 0000000000000000
 ---[ end trace 890710798f4c0070 ]---


^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA
  2010-04-10 16:38                                                                                                         ` Borislav Petkov
  2010-04-10 17:05                                                                                                           ` Linus Torvalds
@ 2010-04-10 17:07                                                                                                           ` Borislav Petkov
  1 sibling, 0 replies; 242+ messages in thread
From: Borislav Petkov @ 2010-04-10 17:07 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Johannes Weiner, KOSAKI Motohiro, Rik van Riel, Andrew Morton,
	Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn,
	Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson

From: Borislav Petkov <bp@alien8.de>
Date: Sat, Apr 10, 2010 at 06:38:28PM +0200

> Im going to run the stress test on 2.6.33.2 to verify whether this is
> actually software-related. Just in case.

Just did a bunch of hibernation runs - 2.6.33.2 feels rock solid - no
issues whatsoever. So in the face of such results a hw failure is kinda
unprobable... Hmm...

-- 
Regards/Gruss,
Boris.

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA
  2010-04-10 17:05                                                                                                           ` Linus Torvalds
@ 2010-04-10 18:21                                                                                                             ` Linus Torvalds
  2010-04-10 18:26                                                                                                               ` Linus Torvalds
                                                                                                                                 ` (3 more replies)
  0 siblings, 4 replies; 242+ messages in thread
From: Linus Torvalds @ 2010-04-10 18:21 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Johannes Weiner, KOSAKI Motohiro, Rik van Riel, Andrew Morton,
	Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn,
	Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson



On Sat, 10 Apr 2010, Linus Torvalds wrote:
> On Sat, 10 Apr 2010, Borislav Petkov wrote:
> > 
> > And I got an oops again, this time the #GP from couple of days ago.
> 
> Oh damn. So the list corruption really does happen still.

Ho humm.

Maybe I'm crazy, but something started bothering me. And I started 
wondering: when is the 'page->mapping' of an anonymous page actually 
cleared?

The thing is, the mapping of an anonymous page is actually cleared only 
when the page is _freed_, in "free_hot_cold_page()". 

Now, let's think about that. And in particular, let's think about how that 
relates to the freeing of the 'anon_vma' that the page->mapping points to. 

The way the anon_vma is freed is when the mapping is torn down, and we do 
roughly:

	tlb = tlb_gather_mmu(mm,..)
	..
	unmap_vmas(&tlb, vma ..
	..
	free_pgtables()
	..
	tlb_finish_mmu(tlb, start, end);

and we actually unmap all the pages in "unmap_vmas()", and then _after_ 
unmapping all the pages we do the "unlink_anon_vmas(vma);" in 
"free_pgtables()". Fine so far - the anon_vma stay around until after the 
page has been happily unmapped.

But "unmapped all the pages" is _not_ actually the same as "free'd all the 
pages". The actual _freeing_ of the page happens generally in 
tlb_finish_mmu(), because we can free the page only after we've flushed 
any TLB entries.

So what we have in that tlb_gather structure is a list of _pending_ pages 
to be freed, while we already actually free'd the anon_vmas earlier!

Now, the thing is, tlb_gather_mmu() begins a preempt-safe region (because 
we use a per-cpu variable), but as far as I can tell it is _not_ an 
RCU-safe region.

So I think we might actually get a real RCU freeing event while this all
happens. So now the 'anon_vma' that 'page->mapping' points to has not just 
been released back to the SLUB caches, the page itself might have been 
released too.

I dunno. Does the above sound at all sane? Or am I just raving?

Something hacky like the above might fix it if I'm not just raving. I 
really might be missing something here.

		Linus

---
 include/asm-generic/tlb.h |    3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index e43f976..2678118 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -14,6 +14,7 @@
 #define _ASM_GENERIC__TLB_H
 
 #include <linux/swap.h>
+#include <linux/rcupdate.h>
 #include <asm/pgalloc.h>
 #include <asm/tlbflush.h>
 
@@ -62,6 +63,7 @@ tlb_gather_mmu(struct mm_struct *mm, unsigned int full_mm_flush)
 
 	tlb->fullmm = full_mm_flush;
 
+	rcu_read_lock();
 	return tlb;
 }
 
@@ -90,6 +92,7 @@ tlb_finish_mmu(struct mmu_gather *tlb, unsigned long start, unsigned long end)
 	/* keep the page table cache within bounds */
 	check_pgt_cache();
 
+	rcu_read_unlock();
 	put_cpu_var(mmu_gathers);
 }
 

^ permalink raw reply related	[flat|nested] 242+ messages in thread

* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA
  2010-04-10 18:21                                                                                                             ` Linus Torvalds
@ 2010-04-10 18:26                                                                                                               ` Linus Torvalds
  2010-04-10 18:51                                                                                                               ` Borislav Petkov
                                                                                                                                 ` (2 subsequent siblings)
  3 siblings, 0 replies; 242+ messages in thread
From: Linus Torvalds @ 2010-04-10 18:26 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Johannes Weiner, KOSAKI Motohiro, Rik van Riel, Andrew Morton,
	Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn,
	Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson



On Sat, 10 Apr 2010, Linus Torvalds wrote:
> 
> I dunno. Does the above sound at all sane? Or am I just raving?
> 
> Something hacky like the above might fix it if I'm not just raving. I 
> really might be missing something here.

Btw, if this turns out to be accurate, the real fix is to probably just 
have a separate phase at the very end to actually release all the vma's, 
rather than do it in "free_page_tables()". We don't want to make the 
tlb-gather any more atomic than it already is. In fact, Nick is trying to 
make it preemptible.

So the patch included in that mail was meant very much as a "let's test my 
crazy theory" patch, rather than as the real solution.

The patch is also untested. Maybe it doesn't work at all and introduces 
new bugs. Caveat emptor.

			Linus

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA
  2010-04-10 18:21                                                                                                             ` Linus Torvalds
  2010-04-10 18:26                                                                                                               ` Linus Torvalds
@ 2010-04-10 18:51                                                                                                               ` Borislav Petkov
  2010-04-10 18:58                                                                                                                 ` Borislav Petkov
  2010-04-10 19:36                                                                                                               ` Rik van Riel
  2010-04-12 14:40                                                                                                               ` Peter Zijlstra
  3 siblings, 1 reply; 242+ messages in thread
From: Borislav Petkov @ 2010-04-10 18:51 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Johannes Weiner, KOSAKI Motohiro, Rik van Riel, Andrew Morton,
	Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn,
	Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson

From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Sat, Apr 10, 2010 at 11:21:39AM -0700

> On Sat, 10 Apr 2010, Linus Torvalds wrote:
> > On Sat, 10 Apr 2010, Borislav Petkov wrote:
> > > 
> > > And I got an oops again, this time the #GP from couple of days ago.
> > 
> > Oh damn. So the list corruption really does happen still.
> 
> Ho humm.
> 
> Maybe I'm crazy, but something started bothering me. And I started 
> wondering: when is the 'page->mapping' of an anonymous page actually 
> cleared?
> 
> The thing is, the mapping of an anonymous page is actually cleared only 
> when the page is _freed_, in "free_hot_cold_page()". 
> 
> Now, let's think about that. And in particular, let's think about how that 
> relates to the freeing of the 'anon_vma' that the page->mapping points to. 
> 
> The way the anon_vma is freed is when the mapping is torn down, and we do 
> roughly:
> 
> 	tlb = tlb_gather_mmu(mm,..)
> 	..
> 	unmap_vmas(&tlb, vma ..
> 	..
> 	free_pgtables()
> 	..
> 	tlb_finish_mmu(tlb, start, end);
> 
> and we actually unmap all the pages in "unmap_vmas()", and then _after_ 
> unmapping all the pages we do the "unlink_anon_vmas(vma);" in 
> "free_pgtables()". Fine so far - the anon_vma stay around until after the 
> page has been happily unmapped.
> 
> But "unmapped all the pages" is _not_ actually the same as "free'd all the 
> pages". The actual _freeing_ of the page happens generally in 
> tlb_finish_mmu(), because we can free the page only after we've flushed 
> any TLB entries.
> 
> So what we have in that tlb_gather structure is a list of _pending_ pages 
> to be freed, while we already actually free'd the anon_vmas earlier!
> 
> Now, the thing is, tlb_gather_mmu() begins a preempt-safe region (because 
> we use a per-cpu variable), but as far as I can tell it is _not_ an 
> RCU-safe region.
> 
> So I think we might actually get a real RCU freeing event while this all
> happens. So now the 'anon_vma' that 'page->mapping' points to has not just 
> been released back to the SLUB caches, the page itself might have been 
> released too.

So, if I understand you correctly, the list_head anon_vma gets freed
_before_ the page descriptor itself, therefore we still get a valid
page->mapping in page_lock_anon_vma(). Maybe that explains the funny
patterns in %r13. But how do they come to exist when the anon_vma is
freed, shouldn't there be LIST_POISON or something recognizable?

Anyways, testing...

-- 
Regards/Gruss,
Boris.

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA
  2010-04-10 18:51                                                                                                               ` Borislav Petkov
@ 2010-04-10 18:58                                                                                                                 ` Borislav Petkov
  2010-04-10 20:05                                                                                                                   ` Linus Torvalds
  0 siblings, 1 reply; 242+ messages in thread
From: Borislav Petkov @ 2010-04-10 18:58 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Johannes Weiner, KOSAKI Motohiro, Rik van Riel, Andrew Morton,
	Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn,
	Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson

From: Borislav Petkov <bp@alien8.de>
Date: Sat, Apr 10, 2010 at 08:51:45PM +0200

> Anyways, testing...

Nope, still b0rked. And this time is not a funny pattern but
ffffffffffffffe0 we had originally.

[  521.306972] BUG: unable to handle kernel NULL pointer dereference at (null)
[  521.307126] IP: [<ffffffff810c60b4>] page_referenced+0xee/0x1dc
[  521.307126] PGD 22d952067 PUD 2291db067 PMD 0 
[  521.307126] Oops: 0000 [#1] PREEMPT SMP 
[  521.307126] last sysfs file: /sys/power/state
[  521.307126] CPU 1 
[  521.307126] Modules linked in: powernow_k8 cpufreq_ondemand cpufreq_powersave cpufreq_userspace freq_table cpufreq_conservative binfmt_misc kvm_amd kvm ipv6 vfat fat dm_crypt dm_mod 8250_pnp 8250 pcspkr serial_core ohci_hcd edac_core k10temp
[  521.307126] 
[  521.307126] Pid: 2896, comm: hib.sh Not tainted 2.6.34-rc3-00501-gefb57c0-dirty #5 M3A78 PRO/System Product Name
[  521.307126] RIP: 0010:[<ffffffff810c60b4>]  [<ffffffff810c60b4>] page_referenced+0xee/0x1dc
[  521.307126] RSP: 0018:ffff88022bd9f8b8  EFLAGS: 00010283
[  521.307126] RAX: ffff88022af8c338 RBX: ffffea00067e2998 RCX: 0000000000000000
[  521.307126] RDX: ffff88022bd9fcf8 RSI: ffff88022af8c310 RDI: ffff88022c0c5e60
[  521.307126] RBP: ffff88022bd9f938 R08: 0000000000000002 R09: 0000000000000000
[  521.307126] R10: ffff88022b4454d8 R11: ffffffff00000012 R12: 0000000000000000
[  521.307126] R13: ffffffffffffffe0 R14: ffff88022af8c2f8 R15: ffff88022bd9fa00
[  521.307126] FS:  00007ff70fb586f0(0000) GS:ffff88000a200000(0000) knlGS:0000000000000000
[  521.307126] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  521.307126] CR2: 0000000000000000 CR3: 000000022e19c000 CR4: 00000000000006e0
[  521.307126] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  521.307126] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[  521.307126] Process hib.sh (pid: 2896, threadinfo ffff88022bd9e000, task ffff88022c0c5e60)
[  521.307126] Stack:
[  521.307126]  ffff88022af8c338 00000000810c5dd3 ffff88022bd9f918 ffffffff810c5f3c
[  521.307126] <0> ffff880200000000 ffffffff00000001 ffff88022bd9ffd8 ffffea00067d2cf0
[  521.307126] <0> ffffea00067d2cf0 000000022bd9fcf8 ffffea00067d2cf0 ffffea00067e29c0
[  521.307126] Call Trace:
[  521.307126]  [<ffffffff810c5f3c>] ? try_to_unmap_anon+0xa2/0xb4
[  521.307126]  [<ffffffff810b06bc>] shrink_page_list+0x154/0x4c7
[  521.307126]  [<ffffffff81067149>] ? print_lock_contention_bug+0x1b/0xe1
[  521.307126]  [<ffffffff810af59c>] ? isolate_pages_global+0xd0/0x1fc
[  521.307126]  [<ffffffff8140fa66>] ? _raw_spin_unlock_irq+0x30/0x58
[  521.307126]  [<ffffffff810b0d8a>] shrink_inactive_list+0x35b/0x60c
[  521.307126]  [<ffffffff810b1347>] shrink_zone+0x30c/0x3d6
[  521.307126]  [<ffffffff810b1f3d>] do_try_to_free_pages+0x191/0x29a
[  521.307126]  [<ffffffff810b20db>] shrink_all_memory+0x95/0xc4
[  521.307126]  [<ffffffff810af4cc>] ? isolate_pages_global+0x0/0x1fc
[  521.307126]  [<ffffffff81079c9c>] ? count_data_pages+0x65/0x79
[  521.307126]  [<ffffffff81079f03>] hibernate_preallocate_memory+0x1aa/0x2cb
[  521.307126]  [<ffffffff8140bde4>] ? printk+0x41/0x45
[  521.307126]  [<ffffffff8107878f>] hibernation_snapshot+0x36/0x1e1
[  521.307126]  [<ffffffff81078a08>] hibernate+0xce/0x172
[  521.307126]  [<ffffffff81077775>] state_store+0x5c/0xd3
[  521.307126]  [<ffffffff8118f5eb>] kobj_attr_store+0x17/0x19
[  521.307126]  [<ffffffff8112e4a4>] sysfs_write_file+0x108/0x144
[  521.307126]  [<ffffffff810db6b3>] vfs_write+0xb2/0x153
[  521.307126]  [<ffffffff810663c9>] ? trace_hardirqs_on_caller+0x1f/0x14b
[  521.307126]  [<ffffffff810db817>] sys_write+0x4a/0x71
[  521.307126]  [<ffffffff8100221b>] system_call_fastpath+0x16/0x1b
[  521.307126] Code: 3b 56 10 73 1e 48 83 fa f2 74 18 48 8d 4d cc 4d 89 f8 48 89 df e8 11 f2 ff ff 41 01 c4 83 7d cc 00 74 19 4d 8b 6d 20 49 83 ed 20 <49> 8b 45 20 0f 18 08 49 8d 45 20 48 39 45 80 75 aa 4c 89 f7 e8 
[  521.307126] RIP  [<ffffffff810c60b4>] page_referenced+0xee/0x1dc
[  521.307126]  RSP <ffff88022bd9f8b8>
[  521.307126] CR2: 0000000000000000
[  521.320888] ---[ end trace 023d26183296e92e ]---
[  521.321033] note: hib.sh[2896] exited with preempt_count 2
[  521.321206] BUG: scheduling while atomic: hib.sh/2896/0x10000003
[  521.321355] INFO: lockdep is turned off.
[  521.321500] Modules linked in: powernow_k8 cpufreq_ondemand cpufreq_powersave cpufreq_userspace freq_table cpufreq_conservative binfmt_misc kvm_amd kvm ipv6 vfat fat dm_crypt dm_mod 8250_pnp 8250 pcspkr serial_core ohci_hcd edac_core k10temp
[  521.322884] Pid: 2896, comm: hib.sh Tainted: G      D    2.6.34-rc3-00501-gefb57c0-dirty #5
[  521.323139] Call Trace:
[  521.323288]  [<ffffffff810658df>] ? __debug_show_held_locks+0x1b/0x24
[  521.323440]  [<ffffffff8102dfac>] __schedule_bug+0x72/0x77
[  521.323587]  [<ffffffff8140c1f8>] schedule+0xe3/0x7ff
[  521.323735]  [<ffffffff81030ecb>] __cond_resched+0x18/0x24
[  521.323882]  [<ffffffff8140c9e1>] _cond_resched+0x2c/0x37
[  521.324029]  [<ffffffff810bcef1>] unmap_vmas+0x719/0x911
[  521.324207]  [<ffffffff810c1781>] exit_mmap+0x102/0x1e4
[  521.324356]  [<ffffffff810c16e8>] ? exit_mmap+0x69/0x1e4
[  521.324503]  [<ffffffff810368bc>] mmput+0x48/0xb9
[  521.324651]  [<ffffffff8103ad90>] exit_mm+0x110/0x11d
[  521.324798]  [<ffffffff8103c9e6>] do_exit+0x1c5/0x6e5
[  521.324945]  [<ffffffff81039e2f>] ? kmsg_dump+0x13b/0x155
[  521.325093]  [<ffffffff8100616b>] ? oops_end+0x47/0x93
[  521.325244]  [<ffffffff810061b2>] oops_end+0x8e/0x93
[  521.325396]  [<ffffffff8101f3e5>] no_context+0x1fc/0x20b
[  521.325544]  [<ffffffff8101f580>] __bad_area_nosemaphore+0x18c/0x1af
[  521.325691]  [<ffffffff8101f7bb>] ? do_page_fault+0xa8/0x32d
[  521.325839]  [<ffffffff8101f5b6>] bad_area_nosemaphore+0x13/0x15
[  521.325987]  [<ffffffff8101f886>] do_page_fault+0x173/0x32d
[  521.326138]  [<ffffffff81082b84>] ? __call_rcu+0x11d/0x130
[  521.326289]  [<ffffffff814103e3>] ? error_sti+0x5/0x6
[  521.326437]  [<ffffffff81065387>] ? trace_hardirqs_off_caller+0x1f/0xa9
[  521.326586]  [<ffffffff8140ed0b>] ? trace_hardirqs_off_thunk+0x3a/0x3c
[  521.326737]  [<ffffffff814101ff>] page_fault+0x1f/0x30
[  521.326885]  [<ffffffff810c60b4>] ? page_referenced+0xee/0x1dc
[  521.327034]  [<ffffffff810c6046>] ? page_referenced+0x80/0x1dc
[  521.327185]  [<ffffffff810c5f3c>] ? try_to_unmap_anon+0xa2/0xb4
[  521.327336]  [<ffffffff810b06bc>] shrink_page_list+0x154/0x4c7
[  521.327483]  [<ffffffff81067149>] ? print_lock_contention_bug+0x1b/0xe1
[  521.327632]  [<ffffffff810af59c>] ? isolate_pages_global+0xd0/0x1fc
[  521.327780]  [<ffffffff8140fa66>] ? _raw_spin_unlock_irq+0x30/0x58
[  521.327928]  [<ffffffff810b0d8a>] shrink_inactive_list+0x35b/0x60c
[  521.328079]  [<ffffffff810b1347>] shrink_zone+0x30c/0x3d6
[  521.328232]  [<ffffffff810b1f3d>] do_try_to_free_pages+0x191/0x29a
[  521.328387]  [<ffffffff810b20db>] shrink_all_memory+0x95/0xc4
[  521.328535]  [<ffffffff810af4cc>] ? isolate_pages_global+0x0/0x1fc
[  521.328683]  [<ffffffff81079c9c>] ? count_data_pages+0x65/0x79
[  521.328831]  [<ffffffff81079f03>] hibernate_preallocate_memory+0x1aa/0x2cb
[  521.328979]  [<ffffffff8140bde4>] ? printk+0x41/0x45
[  521.329130]  [<ffffffff8107878f>] hibernation_snapshot+0x36/0x1e1
[  521.329283]  [<ffffffff81078a08>] hibernate+0xce/0x172
[  521.329432]  [<ffffffff81077775>] state_store+0x5c/0xd3
[  521.329580]  [<ffffffff8118f5eb>] kobj_attr_store+0x17/0x19
[  521.329727]  [<ffffffff8112e4a4>] sysfs_write_file+0x108/0x144
[  521.329875]  [<ffffffff810db6b3>] vfs_write+0xb2/0x153
[  521.330022]  [<ffffffff810663c9>] ? trace_hardirqs_on_caller+0x1f/0x14b
[  521.330174]  [<ffffffff810db817>] sys_write+0x4a/0x71
[  521.330326]  [<ffffffff8100221b>] system_call_fastpath+0x16/0x1b

-- 
Regards/Gruss,
Boris.

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA
  2010-04-10 18:21                                                                                                             ` Linus Torvalds
  2010-04-10 18:26                                                                                                               ` Linus Torvalds
  2010-04-10 18:51                                                                                                               ` Borislav Petkov
@ 2010-04-10 19:36                                                                                                               ` Rik van Riel
  2010-04-12 14:40                                                                                                               ` Peter Zijlstra
  3 siblings, 0 replies; 242+ messages in thread
From: Rik van Riel @ 2010-04-10 19:36 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Borislav Petkov, Johannes Weiner, KOSAKI Motohiro, Andrew Morton,
	Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn,
	Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson

On 04/10/2010 02:21 PM, Linus Torvalds wrote:

> Maybe I'm crazy, but something started bothering me. And I started
> wondering: when is the 'page->mapping' of an anonymous page actually
> cleared?
>
> The thing is, the mapping of an anonymous page is actually cleared only
> when the page is _freed_, in "free_hot_cold_page()".

Which is also where they are removed from the LRU.
The plot thickens...

> Now, let's think about that. And in particular, let's think about how that
> relates to the freeing of the 'anon_vma' that the page->mapping points to.
>
> The way the anon_vma is freed is when the mapping is torn down, and we do
> roughly:
>
> 	tlb = tlb_gather_mmu(mm,..)
> 	..
> 	unmap_vmas(&tlb, vma ..
> 	..
> 	free_pgtables()
> 	..
> 	tlb_finish_mmu(tlb, start, end);

Looks like we should move the anon_vma freeing from free_pgtables
over to remove_vma?

This code is just below the tlb_finish_mmu in exit_mmap:

         /*
          * Walk the list again, actually closing and freeing it,
          * with preemption enabled, without holding any MM locks.
          */
         while (vma)
                 vma = remove_vma(vma);

This comment in free_pgtables is a little suspect:

                 /*
                  * Hide vma from rmap and truncate_pagecache before freeing
                  * pgtables
                  */
                 unlink_anon_vmas(vma);
                 unlink_file_vma(vma);

After all, the rmap code will quickly notice that there either are
no page tables, or the page tables no longer have anything in them.

It looks like we may have had this use-after-free bug in the VM for
quite a while...  I am not entirely sure what exposed the bug, but
I can see how it works.


^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA
  2010-04-10 18:58                                                                                                                 ` Borislav Petkov
@ 2010-04-10 20:05                                                                                                                   ` Linus Torvalds
  2010-04-10 20:12                                                                                                                     ` Linus Torvalds
                                                                                                                                       ` (2 more replies)
  0 siblings, 3 replies; 242+ messages in thread
From: Linus Torvalds @ 2010-04-10 20:05 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Johannes Weiner, KOSAKI Motohiro, Rik van Riel, Andrew Morton,
	Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn,
	Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson



On Sat, 10 Apr 2010, Borislav Petkov wrote:
> From: Borislav Petkov <bp@alien8.de>
> Date: Sat, Apr 10, 2010 at 08:51:45PM +0200
> 
> > Anyways, testing...
> 
> Nope, still b0rked. And this time is not a funny pattern but
> ffffffffffffffe0 we had originally.

Ok, I think that just depends on who happens to re-use the allocation and 
how it does it.

I'm pretty sure it's a use-after-free issue, where we have free'd an 
anon_vma too early, even though it has pages associated with it.

If it wasn't the RCU case, it's just something else.

I think it's worth looking at "vma_adjust()", because as I already 
mentioned to Rik earlier - the code is very hard to understand, and it's 
accrued crud over many many years.

And vma_adjust is the one place that does that anon_vma_merge(), which is 
apart from the actual unmapping sequence the only other place that 
actually free's anon_vmas. So there are reasons to be very suspicious of 
that code.

And I think that code can actually lose an anon_vma chain. It's totally 
screwing up the "import anonvma" case: when it does

                        if (anon_vma_clone(importer, vma)) {
                                return -ENOMEM;
                        }
                        importer->anon_vma = anon_vma;

we can actually have "importer == vma", but "anon_vma = next->anon_vma". 

In which case we actually end up with an _empty_ chain (because importer 
didn't have a chain to begin with!) but "importer->anon_vma" points to an 
anon_vma.

And then when we do that "remove_next", we actually get rid of the only 
chain we ever had, and have lost all our references to the anon_vma.

That looks _horribly_ buggy.

Also, the conditional nesting makes no sense (the whole anon_vma_clone() 
only makes sense if importer is set, and it is only ever set _inside_ the 
earlier if-statement, so the whole code should be moved inside there), nor 
does some of the comments.

This patch is scary and untested, but the more I look at that code, the 
more convinced I am that vma_adjust was _really_ badly screwed up. The 
patch below may make things worse. I'll test it myself too, but I'm 
sending it out first, since I was writing the email as I was looking at 
the piece of cr*p.

		Linus

---
 mm/mmap.c |   24 ++++++++----------------
 1 files changed, 8 insertions(+), 16 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index acb023e..f90ea92 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -507,11 +507,12 @@ int vma_adjust(struct vm_area_struct *vma, unsigned long start,
 	struct address_space *mapping = NULL;
 	struct prio_tree_root *root = NULL;
 	struct file *file = vma->vm_file;
-	struct anon_vma *anon_vma = NULL;
 	long adjust_next = 0;
 	int remove_next = 0;
 
 	if (next && !insert) {
+		struct vm_area_struct *exporter = NULL;
+
 		if (end >= next->vm_end) {
 			/*
 			 * vma expands, overlapping all the next, and
@@ -519,7 +520,7 @@ int vma_adjust(struct vm_area_struct *vma, unsigned long start,
 			 */
 again:			remove_next = 1 + (end > next->vm_end);
 			end = next->vm_end;
-			anon_vma = next->anon_vma;
+			exporter = next;
 			importer = vma;
 		} else if (end > next->vm_start) {
 			/*
@@ -527,7 +528,7 @@ again:			remove_next = 1 + (end > next->vm_end);
 			 * mprotect case 5 shifting the boundary up.
 			 */
 			adjust_next = (end - next->vm_start) >> PAGE_SHIFT;
-			anon_vma = next->anon_vma;
+			exporter = next;
 			importer = vma;
 		} else if (end < vma->vm_end) {
 			/*
@@ -536,28 +537,19 @@ again:			remove_next = 1 + (end > next->vm_end);
 			 * mprotect case 4 shifting the boundary down.
 			 */
 			adjust_next = - ((vma->vm_end - end) >> PAGE_SHIFT);
-			anon_vma = next->anon_vma;
+			exporter = vma;
 			importer = next;
 		}
-	}
 
-	/*
-	 * When changing only vma->vm_end, we don't really need anon_vma lock.
-	 */
-	if (vma->anon_vma && (insert || importer || start != vma->vm_start))
-		anon_vma = vma->anon_vma;
-	if (anon_vma) {
 		/*
 		 * Easily overlooked: when mprotect shifts the boundary,
 		 * make sure the expanding vma has anon_vma set if the
 		 * shrinking vma had, to cover any anon pages imported.
 		 */
-		if (importer && !importer->anon_vma) {
-			/* Block reverse map lookups until things are set up. */
-			if (anon_vma_clone(importer, vma)) {
+		if (exporter && exporter->anon_vma && !importer->anon_vma) {
+			if (anon_vma_clone(importer, exporter))
 				return -ENOMEM;
-			}
-			importer->anon_vma = anon_vma;
+			importer->anon_vma = exporter->anon_vma;
 		}
 	}
 

^ permalink raw reply related	[flat|nested] 242+ messages in thread

* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA
  2010-04-10 20:05                                                                                                                   ` Linus Torvalds
@ 2010-04-10 20:12                                                                                                                     ` Linus Torvalds
  2010-04-10 20:36                                                                                                                       ` Borislav Petkov
  2010-04-10 20:24                                                                                                                     ` Rik van Riel
  2010-04-10 20:32                                                                                                                     ` Rik van Riel
  2 siblings, 1 reply; 242+ messages in thread
From: Linus Torvalds @ 2010-04-10 20:12 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Johannes Weiner, KOSAKI Motohiro, Rik van Riel, Andrew Morton,
	Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn,
	Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson



On Sat, 10 Apr 2010, Linus Torvalds wrote:
> 
> This patch is scary and untested, but the more I look at that code, the 
> more convinced I am that vma_adjust was _really_ badly screwed up. The 
> patch below may make things worse. I'll test it myself too, but I'm 
> sending it out first, since I was writing the email as I was looking at 
> the piece of cr*p.

Ok, it boots. Which means it must be bug-free and perfect. And I really am 
convinced that the old vma_adjust() use of anon_vma_clone() was _totally_ 
broken, so this really could explain everything.

The RCU grace period thing for the TLB flush does look like a real bug 
too, but it's one that is probably impossible to hit in practice.

A broken vma_adjust(), however, would seem to be trivial to hit once you 
just get the right memory freeing patterns going, because the anon_vma 
would easily be _loong_ gone because we didn't create a chain to it at 
all, so the anon_vma code decided that it's not used any more.

So I'm actually pretty optimistic that this really is it.

		Linus

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA
  2010-04-10 20:05                                                                                                                   ` Linus Torvalds
  2010-04-10 20:12                                                                                                                     ` Linus Torvalds
@ 2010-04-10 20:24                                                                                                                     ` Rik van Riel
  2010-04-10 20:34                                                                                                                       ` Linus Torvalds
  2010-04-10 20:32                                                                                                                     ` Rik van Riel
  2 siblings, 1 reply; 242+ messages in thread
From: Rik van Riel @ 2010-04-10 20:24 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Borislav Petkov, Johannes Weiner, KOSAKI Motohiro, Andrew Morton,
	Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn,
	Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson

On 04/10/2010 04:05 PM, Linus Torvalds wrote:

> And vma_adjust is the one place that does that anon_vma_merge(), which is
> apart from the actual unmapping sequence the only other place that
> actually free's anon_vmas. So there are reasons to be very suspicious of
> that code.

It frees anon_vma_chain structures, but not actual anon_vmas.

Walking the anon_vma (from rmap) requires the anon_vma->lock,
which is taken in anon_vma_merge whenever a chain is unlinked.

> And I think that code can actually lose an anon_vma chain. It's totally
> screwing up the "import anonvma" case: when it does
>
>                          if (anon_vma_clone(importer, vma)) {
>                                  return -ENOMEM;
>                          }
>                          importer->anon_vma = anon_vma;
>
> we can actually have "importer == vma", but "anon_vma = next->anon_vma".

A few lines up from that code, we have:

         if (vma->anon_vma && (insert || importer || start != 
vma->vm_start))
                 anon_vma = vma->anon_vma;

So anon_vma should always be vma->anon_vma.

If we have already imported an anon_vma, we will not
do so twice, because of the !importer->anon_vma check.

What am I overlooking?

> In which case we actually end up with an _empty_ chain (because importer
> didn't have a chain to begin with!) but "importer->anon_vma" points to an
> anon_vma.

If we import a chain, from vma to importer, importer->anon_vma
will be equal to vma->anon_vma.

I do not see how 'importer' could get a state different from 'vma'.

> Also, the conditional nesting makes no sense (the whole anon_vma_clone()
> only makes sense if importer is set, and it is only ever set _inside_ the
> earlier if-statement, so the whole code should be moved inside there), nor
> does some of the comments.

No argument there, vma_adjust is very hard to read and it took
me a few days to convince myself that my changes kept things
equivalent to how they were before.


^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA
  2010-04-10 20:05                                                                                                                   ` Linus Torvalds
  2010-04-10 20:12                                                                                                                     ` Linus Torvalds
  2010-04-10 20:24                                                                                                                     ` Rik van Riel
@ 2010-04-10 20:32                                                                                                                     ` Rik van Riel
  2 siblings, 0 replies; 242+ messages in thread
From: Rik van Riel @ 2010-04-10 20:32 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Borislav Petkov, Johannes Weiner, KOSAKI Motohiro, Andrew Morton,
	Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn,
	Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson

On 04/10/2010 04:05 PM, Linus Torvalds wrote:

> This patch is scary and untested, but the more I look at that code, the
> more convinced I am that vma_adjust was _really_ badly screwed up. The
> patch below may make things worse. I'll test it myself too, but I'm
> sending it out first, since I was writing the email as I was looking at
> the piece of cr*p.

Your patch looks correct.  Gotta love how before,
"vma" could be either exporter or importer!

I'm guessing that it did not break before my
changes, because of plain old luck...

Acked-by: Rik van Riel <riel@redhat.com>

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA
  2010-04-10 20:24                                                                                                                     ` Rik van Riel
@ 2010-04-10 20:34                                                                                                                       ` Linus Torvalds
  2010-04-10 20:43                                                                                                                         ` Rik van Riel
  0 siblings, 1 reply; 242+ messages in thread
From: Linus Torvalds @ 2010-04-10 20:34 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Borislav Petkov, Johannes Weiner, KOSAKI Motohiro, Andrew Morton,
	Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn,
	Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson



On Sat, 10 Apr 2010, Rik van Riel wrote:

> On 04/10/2010 04:05 PM, Linus Torvalds wrote:
> 
> > And vma_adjust is the one place that does that anon_vma_merge(), which is
> > apart from the actual unmapping sequence the only other place that
> > actually free's anon_vmas. So there are reasons to be very suspicious of
> > that code.
> 
> It frees anon_vma_chain structures, but not actual anon_vmas.

Rik, I think you're ignoring the fact that the anon_vma_chain is also the 
implicit refcount.

So when you don't create the chains, you implicitly end up freeing the 
anon_vma too early. In fact, it might well happen at that 
'anon_vma_merge()': when it does the unlink_anon_vmas(), it may be 
unlinking the last remaining anon_vma ref, and then anon_vma_unlink 
_will_ in fact free the anon_vma.

Even though we have a 'vma->anon_vma' pointer that points to it - because 
the chains weren't set up correctly.

> Walking the anon_vma (from rmap) requires the anon_vma->lock,
> which is taken in anon_vma_merge whenever a chain is unlinked.

None of that matters. If the dang thing got free'd, the lock isn't 
reliable any more.

> A few lines up from that code, we have:
> 
>         if (vma->anon_vma && (insert || importer || start != vma->vm_start))
>                 anon_vma = vma->anon_vma;
> 
> So anon_vma should always be vma->anon_vma.

No. vma->anon_vma is NULL, so the above lines are total no-ops. We're 
trying to _fill_ it. But we're doing it wrong.

So we end up with:

	anon_vma = next->anon-vma
	importer = vma

and we do:

	if (anon_vma_clone(importer, vma)) {
		return -ENOMEM;
	}
	importer->anon_vma = anon_vma;

do you see?

The "anon_vma_clone(importer, vma)" does NOTHING, because it is cloning 
from the wrong source (from 'vma', rather than from 'next', so it leaves 
the vma chains empty.

And then, despite having empty chains, we do that

	importer->anon_vma = anon_vma;

which sets the anon_vma to the (non-NULL) next->anon_vma.

And then, a bit later, we'll do

	anon_vma_merge(vma, next);

which will happily notice that the anon_vma's of both vma and next match 
(because we just _set_ them to match), and then frees the ONLY REMAINING 
CHAIN - the one in next. The one we DID NOT CORRECTLY COPY, because we got 
our sources completely screwed up.
	
> What am I overlooking?

Can you see it now?

> If we import a chain, from vma to importer, importer->anon_vma
> will be equal to vma->anon_vma.

The thing you seem to miss is that we aren't supposed to import the chain 
from 'vma' AT ALL. The anon_vma came from _next_, not from 'vma'!

> I do not see how 'importer' could get a state different from 'vma'.

Stop worrying about 'vma'. Start worrying about 'next'.

			Linus

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA
  2010-04-10 20:12                                                                                                                     ` Linus Torvalds
@ 2010-04-10 20:36                                                                                                                       ` Borislav Petkov
  2010-04-10 20:40                                                                                                                         ` Linus Torvalds
  0 siblings, 1 reply; 242+ messages in thread
From: Borislav Petkov @ 2010-04-10 20:36 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Johannes Weiner, KOSAKI Motohiro, Rik van Riel, Andrew Morton,
	Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn,
	Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson

From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Sat, Apr 10, 2010 at 01:12:46PM -0700

> So I'm actually pretty optimistic that this really is it.

Ok, let me verify what/in which order should be tested before I test
something wrongly. The RCU-safe fix for the TLB flush can stay for
correctness reasons, this last patch, obviosly, what happens with the
find_mergeable_anon_vma() changes to use only singleton lists for
merging? Should I keep those too?

-- 
Regards/Gruss,
Boris.

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA
  2010-04-10 20:36                                                                                                                       ` Borislav Petkov
@ 2010-04-10 20:40                                                                                                                         ` Linus Torvalds
  2010-04-10 21:25                                                                                                                           ` Borislav Petkov
  0 siblings, 1 reply; 242+ messages in thread
From: Linus Torvalds @ 2010-04-10 20:40 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Johannes Weiner, KOSAKI Motohiro, Rik van Riel, Andrew Morton,
	Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn,
	Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson



On Sat, 10 Apr 2010, Borislav Petkov wrote:
> From: Linus Torvalds <torvalds@linux-foundation.org>
> Date: Sat, Apr 10, 2010 at 01:12:46PM -0700
> 
> > So I'm actually pretty optimistic that this really is it.
> 
> Ok, let me verify what/in which order should be tested before I test
> something wrongly. The RCU-safe fix for the TLB flush can stay for
> correctness reasons, this last patch, obviosly, what happens with the
> find_mergeable_anon_vma() changes to use only singleton lists for
> merging? Should I keep those too?

Yes. So the patches I actually think are important are:

 - the RCU fix is real, although admittedly the race window is probably 
   too small to ever really hit.

 - the simplification rule to find_mergeable_anon_vma's is required, 
   because otherwise our anon_vma_merge() will do the wrong thing (maybe 
   Johannes' patch would be an alternative, but quite frankly, I think we 
   want the simpler code, and I don't think we even _want_ to share 
   anon_vma's that are complex due to forking)

   I like my "cleanup" version (the bigger one with lots of comments) more 
   than the two-liner version, but they should be equivalent.

 - the vma_adjust() fix is the one that I think may actually end up fixing 
   your problems for good. Knock wood.

So I think they are all required, but I suspect that the vma_adjust() one 
is finally the most direct explanation of the problem you've seen.

		Linus

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA
  2010-04-10 20:34                                                                                                                       ` Linus Torvalds
@ 2010-04-10 20:43                                                                                                                         ` Rik van Riel
  0 siblings, 0 replies; 242+ messages in thread
From: Rik van Riel @ 2010-04-10 20:43 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Borislav Petkov, Johannes Weiner, KOSAKI Motohiro, Andrew Morton,
	Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn,
	Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson

On 04/10/2010 04:34 PM, Linus Torvalds wrote:

>> What am I overlooking?
>
> Can you see it now?

Yeah, after reading through your patch it became obvious.
It's the code above this code that sets up the problem.

It's a small miracle it worked before...

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA
  2010-04-10 20:40                                                                                                                         ` Linus Torvalds
@ 2010-04-10 21:25                                                                                                                           ` Borislav Petkov
  2010-04-10 21:30                                                                                                                             ` Linus Torvalds
  0 siblings, 1 reply; 242+ messages in thread
From: Borislav Petkov @ 2010-04-10 21:25 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Johannes Weiner, KOSAKI Motohiro, Rik van Riel, Andrew Morton,
	Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn,
	Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson

From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Sat, Apr 10, 2010 at 01:40:39PM -0700

> Yes. So the patches I actually think are important are:
> 
>  - the RCU fix is real, although admittedly the race window is probably 
>    too small to ever really hit.
> 
>  - the simplification rule to find_mergeable_anon_vma's is required, 
>    because otherwise our anon_vma_merge() will do the wrong thing (maybe 
>    Johannes' patch would be an alternative, but quite frankly, I think we 
>    want the simpler code, and I don't think we even _want_ to share 
>    anon_vma's that are complex due to forking)
> 
>    I like my "cleanup" version (the bigger one with lots of comments) more 
>    than the two-liner version, but they should be equivalent.
> 
>  - the vma_adjust() fix is the one that I think may actually end up fixing 
>    your problems for good. Knock wood.
> 
> So I think they are all required, but I suspect that the vma_adjust() one 
> is finally the most direct explanation of the problem you've seen.

Damn, nope, still no joy :(. It looked like it was fixed but one of the
test was to hibernate right after the 3 kvm guests were shut down and I
guess the mem freeing pattern kinda hits it where it most hurts.

Anyways, I'm going to bed soon, will test whatever you come up with guys
tomorrow morning when I can think again.

By the way, do we want to create a new thread - the mailchain is off the
screen limits of my netbook :)

Thanks.

p.s. Oopsie:


[  647.288638] PM: Syncing filesystems ... done.
[  647.307459] Freezing user space processes ... (elapsed 0.01 seconds) done.
[  647.320981] Freezing remaining freezable tasks ... (elapsed 0.01 seconds) done.
[  647.334152] PM: Preallocating image memory... 
[  647.492781] BUG: unable to handle kernel NULL pointer dereference at (null)
[  647.493001] IP: [<ffffffff810c60a0>] page_referenced+0xee/0x1dc
[  647.493001] PGD 22a1d1067 PUD 1cb6a9067 PMD 0 
[  647.493001] Oops: 0000 [#1] PREEMPT SMP 
[  647.493001] last sysfs file: /sys/power/state
[  647.493001] CPU 0 
[  647.493001] Modules linked in: powernow_k8 cpufreq_ondemand cpufreq_powersave cpufreq_userspace freq_table cpufreq_conservative binfmt_misc kvm_amd kvm ipv6 vfat fat dm_crypt dm_mod 8250_pnp ohci_hcd 8250 serial_core pcspkr k10temp edac_core
[  647.493001] 
[  647.493001] Pid: 3231, comm: hib.sh Not tainted 2.6.34-rc3-00503-g8b3334b #6 M3A78 PRO/System Product Name
[  647.493001] RIP: 0010:[<ffffffff810c60a0>]  [<ffffffff810c60a0>] page_referenced+0xee/0x1dc
[  647.493001] RSP: 0018:ffff880223b6f8b8  EFLAGS: 00010283
[  647.493001] RAX: ffff88022aa316c8 RBX: ffffea0006882fc0 RCX: 0000000000000000
[  647.493001] RDX: ffff880223b6fcf8 RSI: ffff88022aa316a0 RDI: ffff88022de6de60
[  647.493001] RBP: ffff880223b6f938 R08: 0000000000000002 R09: 0000000000000000
[  647.493001] R10: ffff880228cb03a8 R11: ffffffff00000012 R12: 0000000000000000
[  647.493001] R13: ffffffffffffffe0 R14: ffff88022aa31688 R15: ffff880223b6fa00
[  647.493001] FS:  00007f0eea2086f0(0000) GS:ffff88000a000000(0000) knlGS:0000000000000000
[  647.493001] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  647.493001] CR2: 0000000000000000 CR3: 0000000223df5000 CR4: 00000000000006f0
[  647.493001] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  647.493001] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[  647.493001] Process hib.sh (pid: 3231, threadinfo ffff880223b6e000, task ffff88022de6de60)
[  647.493001] Stack:
[  647.493001]  ffff88022aa316c8 00000000810c5dbf ffff880223b6f918 ffffffff810c5f28
[  647.493001] <0> ffff880223b6f8f8 ffffffff00000001 ffffea0006867570 ffffea0006889070
[  647.493001] <0> ffffea0006889070 0000000223b6fcf8 ffffea0006889070 ffffea0006882fe8
[  647.493001] Call Trace:
[  647.493001]  [<ffffffff810c5f28>] ? try_to_unmap_anon+0xa2/0xb4
[  647.493001]  [<ffffffff810b06bc>] shrink_page_list+0x154/0x4c7
[  647.493001]  [<ffffffff810b0d8a>] shrink_inactive_list+0x35b/0x60c
[  647.493001]  [<ffffffff810b1155>] ? shrink_zone+0x11a/0x3d6
[  647.493001]  [<ffffffff81067149>] ? print_lock_contention_bug+0x1b/0xe1
[  647.493001]  [<ffffffff8140f000>] ? _raw_spin_lock_irq+0x19/0x79
[  647.493001]  [<ffffffff810b1347>] shrink_zone+0x30c/0x3d6
[  647.493001]  [<ffffffff810b155b>] ? shrink_slab+0x14a/0x15c
[  647.493001]  [<ffffffff810b1f3d>] do_try_to_free_pages+0x191/0x29a
[  647.493001]  [<ffffffff810b20db>] shrink_all_memory+0x95/0xc4
[  647.493001]  [<ffffffff810af4cc>] ? isolate_pages_global+0x0/0x1fc
[  647.493001]  [<ffffffff81079c9c>] ? count_data_pages+0x65/0x79
[  647.493001]  [<ffffffff81079f03>] hibernate_preallocate_memory+0x1aa/0x2cb
[  647.493001]  [<ffffffff8140bdd4>] ? printk+0x41/0x45
[  647.493001]  [<ffffffff8107878f>] hibernation_snapshot+0x36/0x1e1
[  647.493001]  [<ffffffff81078a08>] hibernate+0xce/0x172
[  647.493001]  [<ffffffff81077775>] state_store+0x5c/0xd3
[  647.493001]  [<ffffffff8118f5d7>] kobj_attr_store+0x17/0x19
[  647.493001]  [<ffffffff8112e490>] sysfs_write_file+0x108/0x144
[  647.493001]  [<ffffffff810db69f>] vfs_write+0xb2/0x153
[  647.493001]  [<ffffffff810663c9>] ? trace_hardirqs_on_caller+0x1f/0x14b
[  647.493001]  [<ffffffff810db803>] sys_write+0x4a/0x71
[  647.493001]  [<ffffffff8100221b>] system_call_fastpath+0x16/0x1b
[  647.493001] Code: 3b 56 10 73 1e 48 83 fa f2 74 18 48 8d 4d cc 4d 89 f8 48 89 df e8 11 f2 ff ff 41 01 c4 83 7d cc 00 74 19 4d 8b 6d 20 49 83 ed 20 <49> 8b 45 20 0f 18 08 49 8d 45 20 48 39 45 80 75 aa 4c 89 f7 e8 
[  647.493001] RIP  [<ffffffff810c60a0>] page_referenced+0xee/0x1dc
[  647.493001]  RSP <ffff880223b6f8b8>
[  647.493001] CR2: 0000000000000000
[  647.508991] ---[ end trace 91f57fb5ef398fd2 ]---
[  647.509150] note: hib.sh[3231] exited with preempt_count 2
[  647.509311] BUG: scheduling while atomic: hib.sh/3231/0x10000003
[  647.509462] INFO: lockdep is turned off.
[  647.509610] Modules linked in: powernow_k8 cpufreq_ondemand cpufreq_powersave cpufreq_userspace freq_table cpufreq_conservative binfmt_misc kvm_amd kvm ipv6 vfat fat dm_crypt dm_mod 8250_pnp ohci_hcd 8250 serial_core pcspkr k10temp edac_core
[  647.511093] Pid: 3231, comm: hib.sh Tainted: G      D    2.6.34-rc3-00503-g8b3334b #6
[  647.511353] Call Trace:
[  647.511504]  [<ffffffff810658df>] ? __debug_show_held_locks+0x1b/0x24
[  647.511658]  [<ffffffff8102dfac>] __schedule_bug+0x72/0x77
[  647.511811]  [<ffffffff8140c1e8>] schedule+0xe3/0x7ff
[  647.511962]  [<ffffffff810bd0e4>] ? unmap_vmas+0x90c/0x911
[  647.512191]  [<ffffffff81030ecb>] __cond_resched+0x18/0x24
[  647.512337]  [<ffffffff8140c9d1>] _cond_resched+0x2c/0x37
[  647.512550]  [<ffffffff810bcef1>] unmap_vmas+0x719/0x911
[  647.512697]  [<ffffffff810c1781>] exit_mmap+0x102/0x1e4
[  647.512911]  [<ffffffff810c16e8>] ? exit_mmap+0x69/0x1e4
[  647.513082]  [<ffffffff810368bc>] mmput+0x48/0xb9
[  647.513233]  [<ffffffff8103ad90>] exit_mm+0x110/0x11d
[  647.513387]  [<ffffffff8103c9e6>] do_exit+0x1c5/0x6e5
[  647.513538]  [<ffffffff81039e2f>] ? kmsg_dump+0x13b/0x155
[  647.513690]  [<ffffffff8100616b>] ? oops_end+0x47/0x93
[  647.513859]  [<ffffffff810061b2>] oops_end+0x8e/0x93
[  647.514009]  [<ffffffff8101f3e5>] no_context+0x1fc/0x20b
[  647.514172]  [<ffffffff8118b72b>] ? cfq_insert_request+0x7a/0x3b1
[  647.514321]  [<ffffffff8101f580>] __bad_area_nosemaphore+0x18c/0x1af
[  647.514473]  [<ffffffff8101f7bb>] ? do_page_fault+0xa8/0x32d
[  647.514625]  [<ffffffff8101f5b6>] bad_area_nosemaphore+0x13/0x15
[  647.514777]  [<ffffffff8101f886>] do_page_fault+0x173/0x32d
[  647.514929]  [<ffffffff814103a3>] ? error_sti+0x5/0x6
[  647.515084]  [<ffffffff81065387>] ? trace_hardirqs_off_caller+0x1f/0xa9
[  647.515242]  [<ffffffff8140ecfb>] ? trace_hardirqs_off_thunk+0x3a/0x3c
[  647.515397]  [<ffffffff814101bf>] page_fault+0x1f/0x30
[  647.515549]  [<ffffffff810c60a0>] ? page_referenced+0xee/0x1dc
[  647.515701]  [<ffffffff810c6032>] ? page_referenced+0x80/0x1dc
[  647.515853]  [<ffffffff810c5f28>] ? try_to_unmap_anon+0xa2/0xb4
[  647.516010]  [<ffffffff810b06bc>] shrink_page_list+0x154/0x4c7
[  647.516167]  [<ffffffff810b0d8a>] shrink_inactive_list+0x35b/0x60c
[  647.516323]  [<ffffffff810b1155>] ? shrink_zone+0x11a/0x3d6
[  647.516474]  [<ffffffff81067149>] ? print_lock_contention_bug+0x1b/0xe1
[  647.516627]  [<ffffffff8140f000>] ? _raw_spin_lock_irq+0x19/0x79
[  647.516780]  [<ffffffff810b1347>] shrink_zone+0x30c/0x3d6
[  647.516931]  [<ffffffff810b155b>] ? shrink_slab+0x14a/0x15c
[  647.517086]  [<ffffffff810b1f3d>] do_try_to_free_pages+0x191/0x29a
[  647.517243]  [<ffffffff810b20db>] shrink_all_memory+0x95/0xc4
[  647.517398]  [<ffffffff810af4cc>] ? isolate_pages_global+0x0/0x1fc
[  647.517551]  [<ffffffff81079c9c>] ? count_data_pages+0x65/0x79
[  647.517703]  [<ffffffff81079f03>] hibernate_preallocate_memory+0x1aa/0x2cb
[  647.517856]  [<ffffffff8140bdd4>] ? printk+0x41/0x45
[  647.518011]  [<ffffffff8107878f>] hibernation_snapshot+0x36/0x1e1
[  647.518168]  [<ffffffff81078a08>] hibernate+0xce/0x172
[  647.518322]  [<ffffffff81077775>] state_store+0x5c/0xd3
[  647.518473]  [<ffffffff8118f5d7>] kobj_attr_store+0x17/0x19
[  647.518625]  [<ffffffff8112e490>] sysfs_write_file+0x108/0x144
[  647.518777]  [<ffffffff810db69f>] vfs_write+0xb2/0x153
[  647.518928]  [<ffffffff810663c9>] ? trace_hardirqs_on_caller+0x1f/0x14b
[  647.519084]  [<ffffffff810db803>] sys_write+0x4a/0x71
[  647.519240]  [<ffffffff8100221b>] system_call_fastpath+0x16/0x1b
[  699.648857] SysRq : HELP : loglevel(0-9) reBoot Crash show-all-locks(D) terminate-all-tasks(E) memory-full-oom-kill(F) kill-all-tasks(I) thaw-filesystems(J) saK show-backtrace-all-active-cpus(L) show-memory-usage(M) nice-all-RT-tasks(N) powerOff show-registers(P) show-all-timers(Q) unRaw Sync show-task-states(T) Unmount show-blocked-tasks(W) dump-ftrace-buffer(Z) 
[  700.234923] SysRq : Emergency Sync
[  700.235341] Emergency Sync complete
[  700.982072] SysRq : Emergency Remount R/O
[  701.600802] SysRq : Resetting

-- 
Regards/Gruss,
Boris.

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA
  2010-04-10 21:25                                                                                                                           ` Borislav Petkov
@ 2010-04-10 21:30                                                                                                                             ` Linus Torvalds
  2010-04-10 21:51                                                                                                                               ` Borislav Petkov
  0 siblings, 1 reply; 242+ messages in thread
From: Linus Torvalds @ 2010-04-10 21:30 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Johannes Weiner, KOSAKI Motohiro, Rik van Riel, Andrew Morton,
	Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn,
	Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson



On Sat, 10 Apr 2010, Borislav Petkov wrote:
> 
> Damn, nope, still no joy :(. It looked like it was fixed but one of the
> test was to hibernate right after the 3 kvm guests were shut down and I
> guess the mem freeing pattern kinda hits it where it most hurts.

Damn, I really hoped that was it. Three independent bugs found and fixed, 
and still no joy? Oh well.

> By the way, do we want to create a new thread - the mailchain is off the
> screen limits of my netbook :)

I prefer to keep it in one thread so that they all show up together if I 
need to, but feel free to start a new one. Not a biggie.

> [  647.492781] BUG: unable to handle kernel NULL pointer dereference at (null)
> [  647.493001] IP: [<ffffffff810c60a0>] page_referenced+0xee/0x1dc

Well, it sure is consistent. I'll start to think about what else could go 
wrong..

		Linus

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA
  2010-04-10 21:30                                                                                                                             ` Linus Torvalds
@ 2010-04-10 21:51                                                                                                                               ` Borislav Petkov
  2010-04-11 13:08                                                                                                                                 ` Borislav Petkov
  0 siblings, 1 reply; 242+ messages in thread
From: Borislav Petkov @ 2010-04-10 21:51 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Johannes Weiner, KOSAKI Motohiro, Rik van Riel, Andrew Morton,
	Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn,
	Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson

From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Sat, Apr 10, 2010 at 02:30:49PM -0700

> On Sat, 10 Apr 2010, Borislav Petkov wrote:
> > 
> > Damn, nope, still no joy :(. It looked like it was fixed but one of the
> > test was to hibernate right after the 3 kvm guests were shut down and I
> > guess the mem freeing pattern kinda hits it where it most hurts.
> 
> Damn, I really hoped that was it. Three independent bugs found and fixed, 
> and still no joy? Oh well.

Yep, I'll redo the testing tomorrow, so that we are sure that even with
the _three_ bugs fixed we still hit the funky list element issue.

> > By the way, do we want to create a new thread - the mailchain is off the
> > screen limits of my netbook :)
> 
> I prefer to keep it in one thread so that they all show up together if I 
> need to, but feel free to start a new one. Not a biggie.

I'll keep the thread then - I didn't know it mattered. Mine was just a
suggestion, nevermind.

> > [  647.492781] BUG: unable to handle kernel NULL pointer dereference at (null)
> > [  647.493001] IP: [<ffffffff810c60a0>] page_referenced+0xee/0x1dc
> 
> Well, it sure is consistent. I'll start to think about what else could go 
> wrong..

Which could mean that even with those issues fixed, the real issue is
yet something else. Because obviously the fixes you throw at it don't
seem to change it - even the traces remain consistent across tests.
And if it is use-after-free case, the funny patterns could be some
shifted SLUB poison values which we happen to "see" through the dangling
pointer...  I dunno.

Hmm.

-- 
Regards/Gruss,
Boris.

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA
  2010-04-10 16:41                                                                                                         ` Linus Torvalds
@ 2010-04-10 22:49                                                                                                           ` Johannes Weiner
  2010-04-10 23:31                                                                                                             ` Linus Torvalds
  0 siblings, 1 reply; 242+ messages in thread
From: Johannes Weiner @ 2010-04-10 22:49 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Borislav Petkov, KOSAKI Motohiro, Rik van Riel, Andrew Morton,
	Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn,
	Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson

On Sat, Apr 10, 2010 at 09:41:52AM -0700, Linus Torvalds wrote:
> [...]
>
> It also splits off the decision of whether we can reuse an non_vma from 
> the decision of whether we can merge the vma's - the two are kind of 
> related, but they are not really the same, and they have different issues. 
> I think it's good to try to keep separate issues separate.
>
> [...]
>
> + * NOTE! The fact that we share an anon_vma doesn't _have_ to mean that
> + * we can merge the two vma's. For example, we refuse to merge a vma if
> + * there is a vm_ops->close() function, because that indicates that the
> + * driver is doing some kind of reference counting. But that doesn't
> + * really matter for the anon_vma sharing case.

I am all in favor of only doing singletons, so that we don't have to
inflict my psycho-active merging routine on civilians.

I am not convinced it's a good idea to share an anon_vma, however, when
we know beforehand the vmas will never merge, because it will increase
rmap overhead of walking unrelated vmas for every page in every vma that
is part of the reused anon_vma.

So we usually take that as a trade-off when there is a chance the vmas
could still reunite and we don't want to spoil that through differing
anon_vmas.

But if it's already clear that they won't, it appears to me it would
be more efficient in the long run to just allocate our own anon_vma.

Did you have something in mind that I missed?

	Hannes

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA
  2010-04-10 22:49                                                                                                           ` Johannes Weiner
@ 2010-04-10 23:31                                                                                                             ` Linus Torvalds
  0 siblings, 0 replies; 242+ messages in thread
From: Linus Torvalds @ 2010-04-10 23:31 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Borislav Petkov, KOSAKI Motohiro, Rik van Riel, Andrew Morton,
	Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn,
	Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson



On Sun, 11 Apr 2010, Johannes Weiner wrote:
> 
> Did you have something in mind that I missed?

Mostly that the corner cases will never matter, and I'd prefer to keep the 
code simpler than to care deeply.

For example, the only case you'd see vm_ops->close() is for special device 
mappings. It's true that they cannot have their vma's merged, but it's 
also true that they (a) will seldom have anon_vma's anyway and (b) would 
never get mapped very many times so that anon_vma merging would be an 
issue.

In other words, it's a "don't care" situation, where to keep the code 
simpler we just document that we don't care.

		Linus

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA
  2010-04-10 21:51                                                                                                                               ` Borislav Petkov
@ 2010-04-11 13:08                                                                                                                                 ` Borislav Petkov
  2010-04-11 13:19                                                                                                                                   ` [PATCH 1/3] mm: make page freeing path RCU-safe Borislav Petkov
                                                                                                                                                     ` (4 more replies)
  0 siblings, 5 replies; 242+ messages in thread
From: Borislav Petkov @ 2010-04-11 13:08 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Johannes Weiner, KOSAKI Motohiro, Rik van Riel, Andrew Morton,
	Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn,
	Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson

From: Borislav Petkov <bp@alien8.de>
Date: Sat, Apr 10, 2010 at 11:51:15PM +0200

> > Damn, I really hoped that was it. Three independent bugs found and fixed, 
> > and still no joy? Oh well.
> 
> Yep, I'll redo the testing tomorrow, so that we are sure that even with
> the _three_ bugs fixed we still hit the funky list element issue.

Ok, I could verify that the three patches we were talking about still
can't fix the issue. However, just to make sure I'm sending the versions
of the patches I used for you guys to check.

[  529.667108] PM: Preallocating image memory... 
[  529.930881] BUG: unable to handle kernel NULL pointer dereference at (null)
[  529.931275] IP: [<ffffffff810c603c>] page_referenced+0xee/0x1dc
[  529.931377] PGD 22e33d067 PUD 22ddc1067 PMD 0 
[  529.931377] Oops: 0000 [#1] PREEMPT SMP 
[  529.931377] last sysfs file: /sys/power/state
[  529.931377] CPU 3 
[  529.931377] Modules linked in: powernow_k8 cpufreq_ondemand cpufreq_powersave cpufreq_userspace freq_table cpufreq_conservative binfmt_misc kvm_amd kvm ipv6 vfat fat dm_crypt dm_mod 8250_pnp 8250 ohci_hcd edac_core serial_core pcspkr k10temp
[  529.931377] 
[  529.931377] Pid: 3354, comm: hib.sh Tainted: G        W  2.6.34-rc3-00503-g0fcc334 #1 M3A78 PRO/System Product Name
[  529.931377] RIP: 0010:[<ffffffff810c603c>]  [<ffffffff810c603c>] page_referenced+0xee/0x1dc
[  529.931377] RSP: 0018:ffff880105a118b8  EFLAGS: 00010283
[  529.931377] RAX: ffff88022dc896c8 RBX: ffffea0007a15e10 RCX: 0000000000000000
[  529.931377] RDX: ffff880105a11cf8 RSI: ffff88022dc896a0 RDI: ffff88022b760000
[  529.931377] RBP: ffff880105a11938 R08: 0000000000000002 R09: 0000000000000000
[  529.931377] R10: 0000000000000000 R11: ffffffff00000012 R12: 0000000000000000
[  529.931377] R13: ffffffffffffffe0 R14: ffff88022dc89688 R15: ffff880105a11a00
[  529.931377] FS:  00007f21045876f0(0000) GS:ffff88000a600000(0000) knlGS:0000000000000000
[  529.931377] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  529.931377] CR2: 0000000000000000 CR3: 000000022b33f000 CR4: 00000000000006e0
[  529.931377] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  529.931377] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[  529.931377] Process hib.sh (pid: 3354, threadinfo ffff880105a10000, task ffff88022b760000)
[  529.931377] Stack:
[  529.931377]  ffff88022dc896c8 00000000810b0082 0000000000000000 0000000000000000
[  529.931377] <0> 0000000000000000 0000000000000000 0000000000000000 0000000000000020
[  529.931377] <0> 0000000000000000 0000000200000000 7fffffffffffffff ffffea0007a15e38
[  529.931377] Call Trace:
[  529.931377]  [<ffffffff810b06bc>] shrink_page_list+0x154/0x4c7
[  529.931377]  [<ffffffff81067149>] ? print_lock_contention_bug+0x1b/0xe1
[  529.931377]  [<ffffffff810af59c>] ? isolate_pages_global+0xd0/0x1fc
[  529.931377]  [<ffffffff8140f9f6>] ? _raw_spin_unlock_irq+0x30/0x58
[  529.931377]  [<ffffffff810b0d8a>] shrink_inactive_list+0x35b/0x60c
[  529.931377]  [<ffffffff810b0556>] ? shrink_active_list+0x232/0x244
[  529.931377]  [<ffffffff810b1347>] shrink_zone+0x30c/0x3d6
[  529.931377]  [<ffffffff810b1f3d>] do_try_to_free_pages+0x191/0x29a
[  529.931377]  [<ffffffff810b20db>] shrink_all_memory+0x95/0xc4
[  529.931377]  [<ffffffff81078e1e>] ? memory_bm_test_bit+0x1/0x30
[  529.931377]  [<ffffffff810af4cc>] ? isolate_pages_global+0x0/0x1fc
[  529.931377]  [<ffffffff81079c9c>] ? count_data_pages+0x65/0x79
[  529.931377]  [<ffffffff81079f03>] hibernate_preallocate_memory+0x1aa/0x2cb
[  529.931377]  [<ffffffff8140bd74>] ? printk+0x41/0x45
[  529.931377]  [<ffffffff8107878f>] hibernation_snapshot+0x36/0x1e1
[  529.931377]  [<ffffffff81078a08>] hibernate+0xce/0x172
[  529.931377]  [<ffffffff81077775>] state_store+0x5c/0xd3
[  529.931377]  [<ffffffff8118f573>] kobj_attr_store+0x17/0x19
[  529.931377]  [<ffffffff8112e42c>] sysfs_write_file+0x108/0x144
[  529.931377]  [<ffffffff810db63b>] vfs_write+0xb2/0x153
[  529.931377]  [<ffffffff810663c9>] ? trace_hardirqs_on_caller+0x1f/0x14b
[  529.931377]  [<ffffffff810db79f>] sys_write+0x4a/0x71
[  529.931377]  [<ffffffff8100221b>] system_call_fastpath+0x16/0x1b
[  529.931377] Code: 3b 56 10 73 1e 48 83 fa f2 74 18 48 8d 4d cc 4d 89 f8 48 89 df e8 11 f2 ff ff 41 01 c4 83 7d cc 00 74 19 4d 8b 6d 20 49 83 ed 20 <49> 8b 45 20 0f 18 08 49 8d 45 20 48 39 45 80 75 aa 4c 89 f7 e8 
[  529.931377] RIP  [<ffffffff810c603c>] page_referenced+0xee/0x1dc
[  529.931377]  RSP <ffff880105a118b8>
[  529.931377] CR2: 0000000000000000
[  529.945250] ---[ end trace caa5471c993e6461 ]---
[  529.945558] note: hib.sh[3354] exited with preempt_count 2
[  529.945710] BUG: scheduling while atomic: hib.sh/3354/0x10000003
[  529.945858] INFO: lockdep is turned off.
[  529.946005] Modules linked in: powernow_k8 cpufreq_ondemand cpufreq_powersave cpufreq_userspace freq_table cpufreq_conservative binfmt_misc kvm_amd kvm ipv6 vfat fat dm_crypt dm_mod 8250_pnp 8250 ohci_hcd edac_core serial_core pcspkr k10temp
[  529.947595] Pid: 3354, comm: hib.sh Tainted: G      D W  2.6.34-rc3-00503-g0fcc334 #1
[  529.947848] Call Trace:
[  529.947993]  [<ffffffff810658df>] ? __debug_show_held_locks+0x1b/0x24
[  529.948147]  [<ffffffff8102dfac>] __schedule_bug+0x72/0x77
[  529.948296]  [<ffffffff8140c188>] schedule+0xe3/0x7ff
[  529.948449]  [<ffffffff810bd0e4>] ? unmap_vmas+0x90c/0x911
[  529.948599]  [<ffffffff81030ecb>] __cond_resched+0x18/0x24
[  529.948748]  [<ffffffff8140c971>] _cond_resched+0x2c/0x37
[  529.948896]  [<ffffffff810bcef1>] unmap_vmas+0x719/0x911
[  529.949049]  [<ffffffff8140f01e>] ? _raw_spin_lock_irqsave+0x1e/0x85
[  529.949199]  [<ffffffff8105a878>] ? up+0x14/0x3e
[  529.949347]  [<ffffffff810c171f>] exit_mmap+0x102/0x1e4
[  529.949639]  [<ffffffff810c1686>] ? exit_mmap+0x69/0x1e4
[  529.949787]  [<ffffffff810368bc>] mmput+0x48/0xb9
[  529.949935]  [<ffffffff8103ad90>] exit_mm+0x110/0x11d
[  529.950087]  [<ffffffff8103c9e6>] do_exit+0x1c5/0x6e5
[  529.950236]  [<ffffffff81039e2f>] ? kmsg_dump+0x13b/0x155
[  529.950525]  [<ffffffff8100616b>] ? oops_end+0x47/0x93
[  529.950671]  [<ffffffff810061b2>] oops_end+0x8e/0x93
[  529.950819]  [<ffffffff8101f3e5>] no_context+0x1fc/0x20b
[  529.950967]  [<ffffffff8101f580>] __bad_area_nosemaphore+0x18c/0x1af
[  529.951120]  [<ffffffff8101f7bb>] ? do_page_fault+0xa8/0x32d
[  529.951276]  [<ffffffff8101f5b6>] bad_area_nosemaphore+0x13/0x15
[  529.951572]  [<ffffffff8101f886>] do_page_fault+0x173/0x32d
[  529.951719]  [<ffffffff81410363>] ? error_sti+0x5/0x6
[  529.951867]  [<ffffffff81065387>] ? trace_hardirqs_off_caller+0x1f/0xa9
[  529.952018]  [<ffffffff8140ec9b>] ? trace_hardirqs_off_thunk+0x3a/0x3c
[  529.952170]  [<ffffffff8141017f>] page_fault+0x1f/0x30
[  529.952319]  [<ffffffff810c603c>] ? page_referenced+0xee/0x1dc
[  529.952615]  [<ffffffff810c5fce>] ? page_referenced+0x80/0x1dc
[  529.952762]  [<ffffffff810b06bc>] shrink_page_list+0x154/0x4c7
[  529.952911]  [<ffffffff81067149>] ? print_lock_contention_bug+0x1b/0xe1
[  529.953065]  [<ffffffff810af59c>] ? isolate_pages_global+0xd0/0x1fc
[  529.953214]  [<ffffffff8140f9f6>] ? _raw_spin_unlock_irq+0x30/0x58
[  529.953363]  [<ffffffff810b0d8a>] shrink_inactive_list+0x35b/0x60c
[  529.953627]  [<ffffffff810b0556>] ? shrink_active_list+0x232/0x244
[  529.953775]  [<ffffffff810b1347>] shrink_zone+0x30c/0x3d6
[  529.953924]  [<ffffffff810b1f3d>] do_try_to_free_pages+0x191/0x29a
[  529.954077]  [<ffffffff810b20db>] shrink_all_memory+0x95/0xc4
[  529.954226]  [<ffffffff81078e1e>] ? memory_bm_test_bit+0x1/0x30
[  529.954486]  [<ffffffff810af4cc>] ? isolate_pages_global+0x0/0x1fc
[  529.954632]  [<ffffffff81079c9c>] ? count_data_pages+0x65/0x79
[  529.954782]  [<ffffffff81079f03>] hibernate_preallocate_memory+0x1aa/0x2cb
[  529.954931]  [<ffffffff8140bd74>] ? printk+0x41/0x45
[  529.955083]  [<ffffffff8107878f>] hibernation_snapshot+0x36/0x1e1
[  529.955233]  [<ffffffff81078a08>] hibernate+0xce/0x172
[  529.955457]  [<ffffffff81077775>] state_store+0x5c/0xd3
[  529.955604]  [<ffffffff8118f573>] kobj_attr_store+0x17/0x19
[  529.955752]  [<ffffffff8112e42c>] sysfs_write_file+0x108/0x144
[  529.955900]  [<ffffffff810db63b>] vfs_write+0xb2/0x153
[  529.956053]  [<ffffffff810663c9>] ? trace_hardirqs_on_caller+0x1f/0x14b
[  529.956202]  [<ffffffff810db79f>] sys_write+0x4a/0x71
[  529.956351]  [<ffffffff8100221b>] system_call_fastpath+0x16/0x1b
[  537.634362] SysRq : HELP : loglevel(0-9) reBoot Crash show-all-locks(D) terminate-all-tasks(E) memory-full-oom-kill(F) kill-all-tasks(I) thaw-filesystems(J) saK show-backtrace-all-active-cpus(L) show-memory-usage(M) nice-all-RT-tasks(N) powerOff show-registers(P) show-all-timers(Q) unRaw Sync show-task-states(T) Unmount show-blocked-tasks(W) dump-ftrace-buffer(Z) 
[  538.129750] SysRq : Emergency Sync
[  538.130161] Emergency Sync complete
[  538.902386] SysRq : Emergency Remount R/O
[  539.328830] SysRq : Resetting

-- 
Regards/Gruss,
Boris.

^ permalink raw reply	[flat|nested] 242+ messages in thread

* [PATCH 1/3] mm: make page freeing path RCU-safe
  2010-04-11 13:08                                                                                                                                 ` Borislav Petkov
@ 2010-04-11 13:19                                                                                                                                   ` Borislav Petkov
  2010-04-11 13:19                                                                                                                                   ` [PATCH 2/3] mm: cleanup find_mergeable_anon_vma complexity Borislav Petkov
                                                                                                                                                     ` (3 subsequent siblings)
  4 siblings, 0 replies; 242+ messages in thread
From: Borislav Petkov @ 2010-04-11 13:19 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Johannes Weiner, KOSAKI Motohiro, Rik van Riel, Andrew Morton,
	Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn,
	Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson

From: Linus Torvalds <torvalds@linux-foundation.org>

On Sat, 10 Apr 2010, Linus Torvalds wrote:
> On Sat, 10 Apr 2010, Borislav Petkov wrote:
> >
> > And I got an oops again, this time the #GP from couple of days ago.
>
> Oh damn. So the list corruption really does happen still.

Ho humm.

Maybe I'm crazy, but something started bothering me. And I started
wondering: when is the 'page->mapping' of an anonymous page actually
cleared?

The thing is, the mapping of an anonymous page is actually cleared only
when the page is _freed_, in "free_hot_cold_page()".

Now, let's think about that. And in particular, let's think about how that
relates to the freeing of the 'anon_vma' that the page->mapping points to.

The way the anon_vma is freed is when the mapping is torn down, and we do
roughly:

	tlb = tlb_gather_mmu(mm,..)
	..
	unmap_vmas(&tlb, vma ..
	..
	free_pgtables()
	..
	tlb_finish_mmu(tlb, start, end);

and we actually unmap all the pages in "unmap_vmas()", and then _after_
unmapping all the pages we do the "unlink_anon_vmas(vma);" in
"free_pgtables()". Fine so far - the anon_vma stay around until after the
page has been happily unmapped.

But "unmapped all the pages" is _not_ actually the same as "free'd all the
pages". The actual _freeing_ of the page happens generally in
tlb_finish_mmu(), because we can free the page only after we've flushed
any TLB entries.

So what we have in that tlb_gather structure is a list of _pending_ pages
to be freed, while we already actually free'd the anon_vmas earlier!

Now, the thing is, tlb_gather_mmu() begins a preempt-safe region (because
we use a per-cpu variable), but as far as I can tell it is _not_ an
RCU-safe region.

So I think we might actually get a real RCU freeing event while this all
happens. So now the 'anon_vma' that 'page->mapping' points to has not just
been released back to the SLUB caches, the page itself might have been
released too.

I dunno. Does the above sound at all sane? Or am I just raving?

Something hacky like the above might fix it if I'm not just raving. I
really might be missing something here.

		Linus
---
 include/asm-generic/tlb.h |    3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index e43f976..2678118 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -14,6 +14,7 @@
 #define _ASM_GENERIC__TLB_H
 
 #include <linux/swap.h>
+#include <linux/rcupdate.h>
 #include <asm/pgalloc.h>
 #include <asm/tlbflush.h>
 
@@ -62,6 +63,7 @@ tlb_gather_mmu(struct mm_struct *mm, unsigned int full_mm_flush)
 
 	tlb->fullmm = full_mm_flush;
 
+	rcu_read_lock();
 	return tlb;
 }
 
@@ -90,6 +92,7 @@ tlb_finish_mmu(struct mmu_gather *tlb, unsigned long start, unsigned long end)
 	/* keep the page table cache within bounds */
 	check_pgt_cache();
 
+	rcu_read_unlock();
 	put_cpu_var(mmu_gathers);
 }
 
-- 
1.7.0.3


^ permalink raw reply related	[flat|nested] 242+ messages in thread

* [PATCH 2/3] mm: cleanup find_mergeable_anon_vma complexity
  2010-04-11 13:08                                                                                                                                 ` Borislav Petkov
  2010-04-11 13:19                                                                                                                                   ` [PATCH 1/3] mm: make page freeing path RCU-safe Borislav Petkov
@ 2010-04-11 13:19                                                                                                                                   ` Borislav Petkov
  2010-04-11 13:19                                                                                                                                   ` [PATCH 3/3] mm: fixup vma_adjust Borislav Petkov
                                                                                                                                                     ` (2 subsequent siblings)
  4 siblings, 0 replies; 242+ messages in thread
From: Borislav Petkov @ 2010-04-11 13:19 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Johannes Weiner, KOSAKI Motohiro, Rik van Riel, Andrew Morton,
	Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn,
	Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson

From: Linus Torvalds <torvalds@linux-foundation.org>

On Sat, 10 Apr 2010, Linus Torvalds wrote:
>
> But I think the fact that you are apparently not able to get the list
> corruption is a good sign. Of course, it might just be harder to trigger,
> and these things could all be a sign of a different bug, but my gut feel
> is that we did fix something, and you are just damn good at stressing the
> new code. Kudos.

Btw, I do hate the current 'find_mergeable_anon_vma()' with its duplicated
checks for prev/next compatibility that I just made even more complex.

So I'm actually inclined to want to write my simple two-liner fix as a
rather more complex cleanup patch, below.

It adds way more lines than it deletes, but a lot of it is comments (and
some of it is just because one routine got split up into three), and I
think it makes the result a lot more readable.

It also splits off the decision of whether we can reuse an non_vma from
the decision of whether we can merge the vma's - the two are kind of
related, but they are not really the same, and they have different issues.
I think it's good to try to keep separate issues separate.

This is UNTESTED! It's meant to be an "obvious cleanup" with no real
semantic difference, but if I did something wrong it won't work. Also note
the comment about the lack of locking between two adjacent anon_vma's
taking a page fault at the same time: the ACCESS_ONCE() is unlikely to
ever matter (anon_vma's are stable once they are set, so it's really just
that you could first load a NULL, and then if you re-load the value you
might get a non-NULL thing).

Also note that when checking whether the anon_vma is a singleton, we don't
hold any lock that protects the list we are checking. But
"list_is_singular()" is safe and won't oops even if the pointers in the
list are crap, because it only _compares_ the prev/next pointers, it
doesn't dereference them.

In short, what I'm saying is that there is a pretty subtle race in the
very very unlikely case that two anon_vma's get prepared concurrently, but
from a correctness standpoint it doesn't matter. We might sometimes - once
in a blue moon - reject an anon_vma that could in theory have been merged,
but that won't hurt.

Comments? Rik, Johannes?

			Linus
---
 mm/mmap.c |   86 ++++++++++++++++++++++++++++++++++++++++++++-----------------
 1 files changed, 62 insertions(+), 24 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index 75557c6..acb023e 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -825,6 +825,61 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm,
 }
 
 /*
+ * Rough compatbility check to quickly see if it's even worth looking
+ * at sharing an anon_vma.
+ *
+ * They need to have the same vm_file, and the flags can only differ
+ * in things that mprotect may change.
+ *
+ * NOTE! The fact that we share an anon_vma doesn't _have_ to mean that
+ * we can merge the two vma's. For example, we refuse to merge a vma if
+ * there is a vm_ops->close() function, because that indicates that the
+ * driver is doing some kind of reference counting. But that doesn't
+ * really matter for the anon_vma sharing case.
+ */
+static int anon_vma_compatible(struct vm_area_struct *a, struct vm_area_struct *b)
+{
+	return a->vm_end == b->vm_start &&
+		mpol_equal(vma_policy(a), vma_policy(b)) &&
+		a->vm_file == b->vm_file &&
+		!((a->vm_flags ^ b->vm_flags) & ~(VM_READ|VM_WRITE|VM_EXEC)) &&
+		b->vm_pgoff == a->vm_pgoff + ((b->vm_start - a->vm_start) >> PAGE_SHIFT);
+}
+
+/*
+ * Do some basic sanity checking to see if we can re-use the anon_vma
+ * from 'old'. The 'a'/'b' vma's are in VM order - one of them will be
+ * the same as 'old', the other will be the new one that is trying
+ * to share the anon_vma.
+ *
+ * NOTE! This runs with mm_sem held for reading, so it is possible that
+ * the anon_vma of 'old' is concurrently in the process of being set up
+ * by another page fault trying to merge _that_. But that's ok: if it
+ * is being set up, that automatically means that it will be a singleton
+ * acceptable for merging, so we can do all of this optimistically. But
+ * we do that ACCESS_ONCE() to make sure that we never re-load the pointer.
+ *
+ * IOW: that the "list_is_singular()" test on the anon_vma_chain only
+ * matters for the 'stable anon_vma' case (ie the thing we want to avoid
+ * is to return an anon_vma that is "complex" due to having gone through
+ * a fork).
+ *
+ * We also make sure that the two vma's are compatible (adjacent,
+ * and with the same memory policies). That's all stable, even with just
+ * a read lock on the mm_sem.
+ */
+static struct anon_vma *reusable_anon_vma(struct vm_area_struct *old, struct vm_area_struct *a, struct vm_area_struct *b)
+{
+	if (anon_vma_compatible(a, b)) {
+		struct anon_vma *anon_vma = ACCESS_ONCE(old->anon_vma);
+
+		if (anon_vma && list_is_singular(&old->anon_vma_chain))
+			return anon_vma;
+	}
+	return NULL;
+}
+
+/*
  * find_mergeable_anon_vma is used by anon_vma_prepare, to check
  * neighbouring vmas for a suitable anon_vma, before it goes off
  * to allocate a new anon_vma.  It checks because a repetitive
@@ -834,28 +889,16 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm,
  */
 struct anon_vma *find_mergeable_anon_vma(struct vm_area_struct *vma)
 {
+	struct anon_vma *anon_vma;
 	struct vm_area_struct *near;
-	unsigned long vm_flags;
 
 	near = vma->vm_next;
 	if (!near)
 		goto try_prev;
 
-	/*
-	 * Since only mprotect tries to remerge vmas, match flags
-	 * which might be mprotected into each other later on.
-	 * Neither mlock nor madvise tries to remerge at present,
-	 * so leave their flags as obstructing a merge.
-	 */
-	vm_flags = vma->vm_flags & ~(VM_READ|VM_WRITE|VM_EXEC);
-	vm_flags |= near->vm_flags & (VM_READ|VM_WRITE|VM_EXEC);
-
-	if (near->anon_vma && vma->vm_end == near->vm_start &&
- 			mpol_equal(vma_policy(vma), vma_policy(near)) &&
-			can_vma_merge_before(near, vm_flags,
-				NULL, vma->vm_file, vma->vm_pgoff +
-				((vma->vm_end - vma->vm_start) >> PAGE_SHIFT)))
-		return near->anon_vma;
+	anon_vma = reusable_anon_vma(near, vma, near);
+	if (anon_vma)
+		return anon_vma;
 try_prev:
 	/*
 	 * It is potentially slow to have to call find_vma_prev here.
@@ -868,14 +911,9 @@ try_prev:
 	if (!near)
 		goto none;
 
-	vm_flags = vma->vm_flags & ~(VM_READ|VM_WRITE|VM_EXEC);
-	vm_flags |= near->vm_flags & (VM_READ|VM_WRITE|VM_EXEC);
-
-	if (near->anon_vma && near->vm_end == vma->vm_start &&
-  			mpol_equal(vma_policy(near), vma_policy(vma)) &&
-			can_vma_merge_after(near, vm_flags,
-				NULL, vma->vm_file, vma->vm_pgoff))
-		return near->anon_vma;
+	anon_vma = reusable_anon_vma(near, near, vma);
+	if (anon_vma)
+		return anon_vma;
 none:
 	/*
 	 * There's no absolute need to look only at touching neighbours:
-- 
1.7.0.3


^ permalink raw reply related	[flat|nested] 242+ messages in thread

* [PATCH 3/3] mm: fixup vma_adjust
  2010-04-11 13:08                                                                                                                                 ` Borislav Petkov
  2010-04-11 13:19                                                                                                                                   ` [PATCH 1/3] mm: make page freeing path RCU-safe Borislav Petkov
  2010-04-11 13:19                                                                                                                                   ` [PATCH 2/3] mm: cleanup find_mergeable_anon_vma complexity Borislav Petkov
@ 2010-04-11 13:19                                                                                                                                   ` Borislav Petkov
  2010-04-11 13:25                                                                                                                                   ` [PATCH 2/3] mm: cleanup find_mergeable_anon_vma complexity Borislav Petkov
  2010-04-11 17:07                                                                                                                                   ` [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA Linus Torvalds
  4 siblings, 0 replies; 242+ messages in thread
From: Borislav Petkov @ 2010-04-11 13:19 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Johannes Weiner, KOSAKI Motohiro, Rik van Riel, Andrew Morton,
	Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn,
	Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson

From: Linus Torvalds <torvalds@linux-foundation.org>

On Sat, 10 Apr 2010, Borislav Petkov wrote:
> From: Borislav Petkov <bp@alien8.de>
> Date: Sat, Apr 10, 2010 at 08:51:45PM +0200
>
> > Anyways, testing...
>
> Nope, still b0rked. And this time is not a funny pattern but
> ffffffffffffffe0 we had originally.

Ok, I think that just depends on who happens to re-use the allocation and
how it does it.

I'm pretty sure it's a use-after-free issue, where we have free'd an
anon_vma too early, even though it has pages associated with it.

If it wasn't the RCU case, it's just something else.

I think it's worth looking at "vma_adjust()", because as I already
mentioned to Rik earlier - the code is very hard to understand, and it's
accrued crud over many many years.

And vma_adjust is the one place that does that anon_vma_merge(), which is
apart from the actual unmapping sequence the only other place that
actually free's anon_vmas. So there are reasons to be very suspicious of
that code.

And I think that code can actually lose an anon_vma chain. It's totally
screwing up the "import anonvma" case: when it does

                        if (anon_vma_clone(importer, vma)) {
                                return -ENOMEM;
                        }
                        importer->anon_vma = anon_vma;

we can actually have "importer == vma", but "anon_vma = next->anon_vma".

In which case we actually end up with an _empty_ chain (because importer
didn't have a chain to begin with!) but "importer->anon_vma" points to an
anon_vma.

And then when we do that "remove_next", we actually get rid of the only
chain we ever had, and have lost all our references to the anon_vma.

That looks _horribly_ buggy.

Also, the conditional nesting makes no sense (the whole anon_vma_clone()
only makes sense if importer is set, and it is only ever set _inside_ the
earlier if-statement, so the whole code should be moved inside there), nor
does some of the comments.

This patch is scary and untested, but the more I look at that code, the
more convinced I am that vma_adjust was _really_ badly screwed up. The
patch below may make things worse. I'll test it myself too, but I'm
sending it out first, since I was writing the email as I was looking at
the piece of cr*p.

		Linus
---
 mm/mmap.c |   24 ++++++++----------------
 1 files changed, 8 insertions(+), 16 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index acb023e..f90ea92 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -507,11 +507,12 @@ int vma_adjust(struct vm_area_struct *vma, unsigned long start,
 	struct address_space *mapping = NULL;
 	struct prio_tree_root *root = NULL;
 	struct file *file = vma->vm_file;
-	struct anon_vma *anon_vma = NULL;
 	long adjust_next = 0;
 	int remove_next = 0;
 
 	if (next && !insert) {
+		struct vm_area_struct *exporter = NULL;
+
 		if (end >= next->vm_end) {
 			/*
 			 * vma expands, overlapping all the next, and
@@ -519,7 +520,7 @@ int vma_adjust(struct vm_area_struct *vma, unsigned long start,
 			 */
 again:			remove_next = 1 + (end > next->vm_end);
 			end = next->vm_end;
-			anon_vma = next->anon_vma;
+			exporter = next;
 			importer = vma;
 		} else if (end > next->vm_start) {
 			/*
@@ -527,7 +528,7 @@ again:			remove_next = 1 + (end > next->vm_end);
 			 * mprotect case 5 shifting the boundary up.
 			 */
 			adjust_next = (end - next->vm_start) >> PAGE_SHIFT;
-			anon_vma = next->anon_vma;
+			exporter = next;
 			importer = vma;
 		} else if (end < vma->vm_end) {
 			/*
@@ -536,28 +537,19 @@ again:			remove_next = 1 + (end > next->vm_end);
 			 * mprotect case 4 shifting the boundary down.
 			 */
 			adjust_next = - ((vma->vm_end - end) >> PAGE_SHIFT);
-			anon_vma = next->anon_vma;
+			exporter = vma;
 			importer = next;
 		}
-	}
 
-	/*
-	 * When changing only vma->vm_end, we don't really need anon_vma lock.
-	 */
-	if (vma->anon_vma && (insert || importer || start != vma->vm_start))
-		anon_vma = vma->anon_vma;
-	if (anon_vma) {
 		/*
 		 * Easily overlooked: when mprotect shifts the boundary,
 		 * make sure the expanding vma has anon_vma set if the
 		 * shrinking vma had, to cover any anon pages imported.
 		 */
-		if (importer && !importer->anon_vma) {
-			/* Block reverse map lookups until things are set up. */
-			if (anon_vma_clone(importer, vma)) {
+		if (exporter && exporter->anon_vma && !importer->anon_vma) {
+			if (anon_vma_clone(importer, exporter))
 				return -ENOMEM;
-			}
-			importer->anon_vma = anon_vma;
+			importer->anon_vma = exporter->anon_vma;
 		}
 	}
 
-- 
1.7.0.3


^ permalink raw reply related	[flat|nested] 242+ messages in thread

* [PATCH 2/3] mm: cleanup find_mergeable_anon_vma complexity
  2010-04-11 13:08                                                                                                                                 ` Borislav Petkov
                                                                                                                                                     ` (2 preceding siblings ...)
  2010-04-11 13:19                                                                                                                                   ` [PATCH 3/3] mm: fixup vma_adjust Borislav Petkov
@ 2010-04-11 13:25                                                                                                                                   ` Borislav Petkov
  2010-04-11 17:07                                                                                                                                   ` [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA Linus Torvalds
  4 siblings, 0 replies; 242+ messages in thread
From: Borislav Petkov @ 2010-04-11 13:25 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Johannes Weiner, KOSAKI Motohiro, Rik van Riel, Andrew Morton,
	Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn,
	Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson

From: Linus Torvalds <torvalds@linux-foundation.org>

On Sat, 10 Apr 2010, Linus Torvalds wrote:
>
> But I think the fact that you are apparently not able to get the list
> corruption is a good sign. Of course, it might just be harder to trigger,
> and these things could all be a sign of a different bug, but my gut feel
> is that we did fix something, and you are just damn good at stressing the
> new code. Kudos.

Btw, I do hate the current 'find_mergeable_anon_vma()' with its duplicated
checks for prev/next compatibility that I just made even more complex.

So I'm actually inclined to want to write my simple two-liner fix as a
rather more complex cleanup patch, below.

It adds way more lines than it deletes, but a lot of it is comments (and
some of it is just because one routine got split up into three), and I
think it makes the result a lot more readable.

It also splits off the decision of whether we can reuse an non_vma from
the decision of whether we can merge the vma's - the two are kind of
related, but they are not really the same, and they have different issues.
I think it's good to try to keep separate issues separate.

This is UNTESTED! It's meant to be an "obvious cleanup" with no real
semantic difference, but if I did something wrong it won't work. Also note
the comment about the lack of locking between two adjacent anon_vma's
taking a page fault at the same time: the ACCESS_ONCE() is unlikely to
ever matter (anon_vma's are stable once they are set, so it's really just
that you could first load a NULL, and then if you re-load the value you
might get a non-NULL thing).

Also note that when checking whether the anon_vma is a singleton, we don't
hold any lock that protects the list we are checking. But
"list_is_singular()" is safe and won't oops even if the pointers in the
list are crap, because it only _compares_ the prev/next pointers, it
doesn't dereference them.

In short, what I'm saying is that there is a pretty subtle race in the
very very unlikely case that two anon_vma's get prepared concurrently, but
from a correctness standpoint it doesn't matter. We might sometimes - once
in a blue moon - reject an anon_vma that could in theory have been merged,
but that won't hurt.

Comments? Rik, Johannes?

			Linus
---
 mm/mmap.c |   86 ++++++++++++++++++++++++++++++++++++++++++++-----------------
 1 files changed, 62 insertions(+), 24 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index 75557c6..acb023e 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -825,6 +825,61 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm,
 }
 
 /*
+ * Rough compatbility check to quickly see if it's even worth looking
+ * at sharing an anon_vma.
+ *
+ * They need to have the same vm_file, and the flags can only differ
+ * in things that mprotect may change.
+ *
+ * NOTE! The fact that we share an anon_vma doesn't _have_ to mean that
+ * we can merge the two vma's. For example, we refuse to merge a vma if
+ * there is a vm_ops->close() function, because that indicates that the
+ * driver is doing some kind of reference counting. But that doesn't
+ * really matter for the anon_vma sharing case.
+ */
+static int anon_vma_compatible(struct vm_area_struct *a, struct vm_area_struct *b)
+{
+	return a->vm_end == b->vm_start &&
+		mpol_equal(vma_policy(a), vma_policy(b)) &&
+		a->vm_file == b->vm_file &&
+		!((a->vm_flags ^ b->vm_flags) & ~(VM_READ|VM_WRITE|VM_EXEC)) &&
+		b->vm_pgoff == a->vm_pgoff + ((b->vm_start - a->vm_start) >> PAGE_SHIFT);
+}
+
+/*
+ * Do some basic sanity checking to see if we can re-use the anon_vma
+ * from 'old'. The 'a'/'b' vma's are in VM order - one of them will be
+ * the same as 'old', the other will be the new one that is trying
+ * to share the anon_vma.
+ *
+ * NOTE! This runs with mm_sem held for reading, so it is possible that
+ * the anon_vma of 'old' is concurrently in the process of being set up
+ * by another page fault trying to merge _that_. But that's ok: if it
+ * is being set up, that automatically means that it will be a singleton
+ * acceptable for merging, so we can do all of this optimistically. But
+ * we do that ACCESS_ONCE() to make sure that we never re-load the pointer.
+ *
+ * IOW: that the "list_is_singular()" test on the anon_vma_chain only
+ * matters for the 'stable anon_vma' case (ie the thing we want to avoid
+ * is to return an anon_vma that is "complex" due to having gone through
+ * a fork).
+ *
+ * We also make sure that the two vma's are compatible (adjacent,
+ * and with the same memory policies). That's all stable, even with just
+ * a read lock on the mm_sem.
+ */
+static struct anon_vma *reusable_anon_vma(struct vm_area_struct *old, struct vm_area_struct *a, struct vm_area_struct *b)
+{
+	if (anon_vma_compatible(a, b)) {
+		struct anon_vma *anon_vma = ACCESS_ONCE(old->anon_vma);
+
+		if (anon_vma && list_is_singular(&old->anon_vma_chain))
+			return anon_vma;
+	}
+	return NULL;
+}
+
+/*
  * find_mergeable_anon_vma is used by anon_vma_prepare, to check
  * neighbouring vmas for a suitable anon_vma, before it goes off
  * to allocate a new anon_vma.  It checks because a repetitive
@@ -834,28 +889,16 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm,
  */
 struct anon_vma *find_mergeable_anon_vma(struct vm_area_struct *vma)
 {
+	struct anon_vma *anon_vma;
 	struct vm_area_struct *near;
-	unsigned long vm_flags;
 
 	near = vma->vm_next;
 	if (!near)
 		goto try_prev;
 
-	/*
-	 * Since only mprotect tries to remerge vmas, match flags
-	 * which might be mprotected into each other later on.
-	 * Neither mlock nor madvise tries to remerge at present,
-	 * so leave their flags as obstructing a merge.
-	 */
-	vm_flags = vma->vm_flags & ~(VM_READ|VM_WRITE|VM_EXEC);
-	vm_flags |= near->vm_flags & (VM_READ|VM_WRITE|VM_EXEC);
-
-	if (near->anon_vma && vma->vm_end == near->vm_start &&
- 			mpol_equal(vma_policy(vma), vma_policy(near)) &&
-			can_vma_merge_before(near, vm_flags,
-				NULL, vma->vm_file, vma->vm_pgoff +
-				((vma->vm_end - vma->vm_start) >> PAGE_SHIFT)))
-		return near->anon_vma;
+	anon_vma = reusable_anon_vma(near, vma, near);
+	if (anon_vma)
+		return anon_vma;
 try_prev:
 	/*
 	 * It is potentially slow to have to call find_vma_prev here.
@@ -868,14 +911,9 @@ try_prev:
 	if (!near)
 		goto none;
 
-	vm_flags = vma->vm_flags & ~(VM_READ|VM_WRITE|VM_EXEC);
-	vm_flags |= near->vm_flags & (VM_READ|VM_WRITE|VM_EXEC);
-
-	if (near->anon_vma && near->vm_end == vma->vm_start &&
-  			mpol_equal(vma_policy(near), vma_policy(vma)) &&
-			can_vma_merge_after(near, vm_flags,
-				NULL, vma->vm_file, vma->vm_pgoff))
-		return near->anon_vma;
+	anon_vma = reusable_anon_vma(near, near, vma);
+	if (anon_vma)
+		return anon_vma;
 none:
 	/*
 	 * There's no absolute need to look only at touching neighbours:
-- 
1.7.0.3


^ permalink raw reply related	[flat|nested] 242+ messages in thread

* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA
  2010-04-11 13:08                                                                                                                                 ` Borislav Petkov
                                                                                                                                                     ` (3 preceding siblings ...)
  2010-04-11 13:25                                                                                                                                   ` [PATCH 2/3] mm: cleanup find_mergeable_anon_vma complexity Borislav Petkov
@ 2010-04-11 17:07                                                                                                                                   ` Linus Torvalds
  2010-04-11 17:16                                                                                                                                     ` Linus Torvalds
  4 siblings, 1 reply; 242+ messages in thread
From: Linus Torvalds @ 2010-04-11 17:07 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Johannes Weiner, KOSAKI Motohiro, Rik van Riel, Andrew Morton,
	Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn,
	Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson



On Sun, 11 Apr 2010, Borislav Petkov wrote:
> 
> Ok, I could verify that the three patches we were talking about still
> can't fix the issue. However, just to make sure I'm sending the versions
> of the patches I used for you guys to check.

Yup, the patches are the ones I wanted you to try.

So either my fixes were buggy (possible, especially for the vma_adjust 
case), or there are other bugs still lurking.

The scary part is that the _old_ anon_vma code didn't really care about 
the anon_vma all that deeply. It was just a placeholder, if you got some 
of it wrong the worst that would probably happen would be that a page 
could never find all the mappings it had. So it was a possible swap 
efficiency problem when we cannot get rid of all mapped pages, but if it 
only happens for some small and unusual special case, nobody would ever 
have noticed.

With the new code, when you have a page that is associated with a stale 
anon_vma, you get the page_referenced() oops instead.

And I can't find the bug. Everything I've looked at looks fine. So I'm 
going to ask you to start applying "validation patches" - code to check 
some internal consistency, and seeing if we break that internal 
consistency somewhere.

It may be that Rik has some patches like this from his development work, 
but here's the first one. This patch should have caught the vma_adjust() 
problem, but all it caught for me was that "anon_vma_clone()" ended up 
cloning the avc entries in the wrong order so the lists didn't actually 
look exactly the same.

The patch fixes that case, so if this triggers any warnings for you, I 
think it's a real bug.

But I'm pretty sure that the problem is that we have a "page->mapping" 
that points to an anon_vma that no longer exists, and you can easily get 
that while still having valid vma chains - they just aren't necessarily 
the complete _set_ of chains they should be.

[ In particular, I think that the _real_ problem is that we don't clear 
  "page->mapping" when we unmap a page.

  See the comment at the end of page_remove_rmap(), and it also explains 
  the test for "page_mapped()" in page_lock_anon_vma().

  But I think the bug you see might be exactly the race between 
  page_mapped() and actually getting the anon_vma spinlock. I'd have 
  expected that window to be too small to ever hit, though, which is why I 
  find it a bit unlikely. But it would explain why you _sometimes_ 
  actually get a hung spinlock too - you never get the spinlock at all, 
  and somebody replaced the data with something that the spinlock code 
  thinks is a locked spinlock - but is no longer a spinlock at all ]

			Linus

---
 mm/mmap.c |   18 ++++++++++++++++++
 mm/rmap.c |    2 +-
 2 files changed, 19 insertions(+), 1 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index f90ea92..890c169 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1565,6 +1565,22 @@ get_unmapped_area(struct file *file, unsigned long addr, unsigned long len,
 
 EXPORT_SYMBOL(get_unmapped_area);
 
+static void verify_vma(struct vm_area_struct *vma)
+{
+	if (vma->anon_vma) {
+		struct anon_vma_chain *avc;
+		if (WARN_ONCE(list_empty(&vma->anon_vma_chain), "vma has anon_vma but empty chain"))
+			return;
+		/* The first entry of the avc chain should match! */
+		avc = list_entry(vma->anon_vma_chain.next, struct anon_vma_chain, same_vma);
+		WARN_ONCE(avc->anon_vma != vma->anon_vma, "anon_vma entry doesn't match anon_vma_chain");
+		WARN_ONCE(avc->vma != vma, "vma entry doesn't match anon_vma_chain");
+	} else {
+		WARN_ONCE(!list_empty(&vma->anon_vma_chain), "vma has no anon_vma but has chain");
+	}
+}
+
+
 /* Look up the first VMA which satisfies  addr < vm_end,  NULL if none. */
 struct vm_area_struct *find_vma(struct mm_struct *mm, unsigned long addr)
 {
@@ -1598,6 +1614,8 @@ struct vm_area_struct *find_vma(struct mm_struct *mm, unsigned long addr)
 				mm->mmap_cache = vma;
 		}
 	}
+	if (vma)
+		verify_vma(vma);
 	return vma;
 }
 
diff --git a/mm/rmap.c b/mm/rmap.c
index eaa7a09..ee97d38 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -182,7 +182,7 @@ int anon_vma_clone(struct vm_area_struct *dst, struct vm_area_struct *src)
 {
 	struct anon_vma_chain *avc, *pavc;
 
-	list_for_each_entry(pavc, &src->anon_vma_chain, same_vma) {
+	list_for_each_entry_reverse(pavc, &src->anon_vma_chain, same_vma) {
 		avc = anon_vma_chain_alloc();
 		if (!avc)
 			goto enomem_failure;

^ permalink raw reply related	[flat|nested] 242+ messages in thread

* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA
  2010-04-11 17:07                                                                                                                                   ` [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA Linus Torvalds
@ 2010-04-11 17:16                                                                                                                                     ` Linus Torvalds
  2010-04-11 18:55                                                                                                                                       ` Borislav Petkov
                                                                                                                                                         ` (2 more replies)
  0 siblings, 3 replies; 242+ messages in thread
From: Linus Torvalds @ 2010-04-11 17:16 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Johannes Weiner, KOSAKI Motohiro, Rik van Riel, Andrew Morton,
	Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn,
	Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson



On Sun, 11 Apr 2010, Linus Torvalds wrote:
> 
>   But I think the bug you see might be exactly the race between 
>   page_mapped() and actually getting the anon_vma spinlock. I'd have 
>   expected that window to be too small to ever hit, though, which is why I 
>   find it a bit unlikely. But it would explain why you _sometimes_ 
>   actually get a hung spinlock too - you never get the spinlock at all, 
>   and somebody replaced the data with something that the spinlock code 
>   thinks is a locked spinlock - but is no longer a spinlock at all ]

Actually, so if it's that race, then we might get rid of the oops with 
this total hack.

NOTE! If this is the race, then the hack really is just a hack, because it 
doesn't really solve anything. We still take the spinlock, and if bad 
things has happened, _that_ can still very much fail, and you get the 
watchdog lockup message instead. So this doesn't really fix anything.

But if this patch changes behavior, and you no longer see the oops, that 
tells us _something_. I'm not sure how useful that "something" is, but it 
at least means that there are no _mapped_ pages that have that stale 
anon_vma pointer in page->mapping.

Conversely, if you still see the oops (rather than the watchdog), that 
means that we actually have pages that are still marked mapped, and that 
despite that mapped state have a stale page->mapping pointer. I actually 
find that the more likely case, because otherwise the window is _so_ small 
that I don't see how you can hit the oops so reliably.

Anyway - probably worth testing, along with the verify_vma() patch. If 
nothing else, if there is no new behavior, even that tells us something. 
Even if that "something" is not a huge piece of information.

		Linus

---
diff --git a/mm/rmap.c b/mm/rmap.c
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -302,7 +302,11 @@ struct anon_vma *page_lock_anon_vma(struct page *page)
 
 	anon_vma = (struct anon_vma *) (anon_mapping - PAGE_MAPPING_ANON);
 	spin_lock(&anon_vma->lock);
-	return anon_vma;
+
+	if (page_mapped(page))
+		return anon_vma;
+
+	spin_unlock(&anon_vma->lock);
 out:
 	rcu_read_unlock();
 	return NULL;

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA
  2010-04-11 17:16                                                                                                                                     ` Linus Torvalds
@ 2010-04-11 18:55                                                                                                                                       ` Borislav Petkov
  2010-04-12  0:13                                                                                                                                         ` Linus Torvalds
  2010-04-11 19:49                                                                                                                                       ` [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA Rik van Riel
  2010-04-11 21:45                                                                                                                                       ` Rik van Riel
  2 siblings, 1 reply; 242+ messages in thread
From: Borislav Petkov @ 2010-04-11 18:55 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Johannes Weiner, KOSAKI Motohiro, Rik van Riel, Andrew Morton,
	Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn,
	Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson

From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Sun, Apr 11, 2010 at 10:16:10AM -0700

> Conversely, if you still see the oops (rather than the watchdog), that 
> means that we actually have pages that are still marked mapped, and that 
> despite that mapped state have a stale page->mapping pointer. I actually 
> find that the more likely case, because otherwise the window is _so_ small 
> that I don't see how you can hit the oops so reliably.

Ok, did test with the all 5 patches applied. It oopsed with the same
trace, see below. Except one kernel/sched.c:3555 warning checking
spinlock count overflowing, nothing else. :(

I tried to see whether the page->mapping pointer is stale, I dunno,
maybe there could be something in the register dump which could tell us
what's happening. This is how I see it, I could very well be wrong and
missing something though:


So, yes, we oops at the same place, however, a bit early we do

	anon_vma = page_lock_anon_vma(page);
	if (!anon_vma)
		return referenced;

which compiles here to

	.loc 1 496 0
	movq	%rbx, %rdi	# page,
	call	page_lock_anon_vma	#
.LVL288:
	.loc 1 497 0
	testq	%rax, %rax	# anon_vma
.LVL289:
	.loc 1 496 0
	movq	%rax, %r14	#, anon_vma

and I checked that on the path before the instruction where we oops we
don't touch %r14 so the value in the register dump below should be that
anon_vma. Which looks like valid kernel pointer. We dereference it later
to get anon_vma->head.next with

	.loc 1 501 0
	movq	64(%r14), %r13	# <variable>.head.next, <variable>.head.next
.LBE1287:
	leaq	64(%r14), %rax	#,
	movq	%rax, -128(%rbp)	#, %sfp
.LBB1288:
	subq	$32, %r13	#, avc

which ends up in %r13 as ffffffffffffffe0.

So, it really looks like at least that list_head in anon_vma is
bollocks, or even the whole anon_vma. So if this is correct, it is
highly likely that the anon_vma is already freed material or not
initialized at all.

Hm...


[  616.317201] Freezing remaining freezable tasks ... (elapsed 0.01 seconds) done.
[  616.329964] PM: Preallocating image memory... 
[  616.586463] BUG: unable to handle kernel NULL pointer dereference at (null)
[  616.586851] IP: [<ffffffff810c614f>] page_referenced+0xee/0x1dc
[  616.587045] PGD 225dcf067 PUD 22627f067 PMD 0 
[  616.587126] Oops: 0000 [#1] PREEMPT SMP 
[  616.587126] last sysfs file: /sys/power/state
[  616.587126] CPU 1 
[  616.587126] Modules linked in: powernow_k8 cpufreq_ondemand cpufreq_powersave cpufreq_userspace freq_table cpufreq_conservative binfmt_misc kvm_amd kvm ipv6 vfat fat dm_crypt dm_mod ohci_hcd edac_core 8250_pnp 8250 serial_core pcspkr k10temp
[  616.587126] 
[  616.587126] Pid: 3453, comm: hib.sh Tainted: G        W  2.6.34-rc3-00505-g1d9bb34 #1 M3A78 PRO/System Product Name
[  616.587126] RIP: 0010:[<ffffffff810c614f>]  [<ffffffff810c614f>] page_referenced+0xee/0x1dc
[  616.587126] RSP: 0018:ffff88022b3258b8  EFLAGS: 00010283
[  616.587126] RAX: ffff880200ba4b88 RBX: ffffea00076b2b30 RCX: ffff88022eacaa58
[  616.587126] RDX: ffffffff810c5e7a RSI: ffff880200ba4b60 RDI: ffff88022fa492e0
[  616.587126] RBP: ffff88022b325938 R08: 0000000000000002 R09: 0000000000000000
[  616.587126] R10: ffff88022eacaa30 R11: 0000000000000001 R12: 0000000000000000
[  616.587126] R13: ffffffffffffffe0 R14: ffff880200ba4b48 R15: ffff88022b325a00
[  616.587126] FS:  00007f0b140306f0(0000) GS:ffff88000a200000(0000) knlGS:0000000000000000
[  616.587126] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[  616.587126] CR2: 0000000000000000 CR3: 000000022c44f000 CR4: 00000000000006e0
[  616.587126] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  616.587126] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[  616.587126] Process hib.sh (pid: 3453, threadinfo ffff88022b324000, task ffff88022fa492e0)
[  616.587126] Stack:
[  616.587126]  ffff880200ba4b88 00000000810c5e5f ffff88022b325918 ffffffff810c5fd7
[  616.587126] <0> ffff880200000000 ffffffff00000001 ffff88022b325fd8 ffffea00076c1a80
[  616.587126] <0> ffffea00076c1a80 000000022b325cf8 ffffea00076c1a80 ffffea00076b2b58
[  616.587126] Call Trace:
[  616.587126]  [<ffffffff810c5fd7>] ? try_to_unmap_anon+0xa2/0xb4
[  616.587126]  [<ffffffff810b06bc>] shrink_page_list+0x154/0x4c7
[  616.587126]  [<ffffffff81067149>] ? print_lock_contention_bug+0x1b/0xe1
[  616.587126]  [<ffffffff810af59c>] ? isolate_pages_global+0xd0/0x1fc
[  616.587126]  [<ffffffff8140fb06>] ? _raw_spin_unlock_irq+0x30/0x58
[  616.587126]  [<ffffffff810b0d8a>] shrink_inactive_list+0x35b/0x60c
[  616.587126]  [<ffffffff810b0556>] ? shrink_active_list+0x232/0x244
[  616.587126]  [<ffffffff810b1347>] shrink_zone+0x30c/0x3d6
[  616.587126]  [<ffffffff810b1f3d>] do_try_to_free_pages+0x191/0x29a
[  616.587126]  [<ffffffff810b20db>] shrink_all_memory+0x95/0xc4
[  616.587126]  [<ffffffff810af4cc>] ? isolate_pages_global+0x0/0x1fc
[  616.587126]  [<ffffffff81079c9c>] ? count_data_pages+0x65/0x79
[  616.587126]  [<ffffffff81079f03>] hibernate_preallocate_memory+0x1aa/0x2cb
[  616.587126]  [<ffffffff8140be84>] ? printk+0x41/0x45
[  616.587126]  [<ffffffff8107878f>] hibernation_snapshot+0x36/0x1e1
[  616.587126]  [<ffffffff81078a08>] hibernate+0xce/0x172
[  616.587126]  [<ffffffff81077775>] state_store+0x5c/0xd3
[  616.587126]  [<ffffffff8118f687>] kobj_attr_store+0x17/0x19
[  616.587126]  [<ffffffff8112e540>] sysfs_write_file+0x108/0x144
[  616.587126]  [<ffffffff810db74f>] vfs_write+0xb2/0x153
[  616.587126]  [<ffffffff810663c9>] ? trace_hardirqs_on_caller+0x1f/0x14b
[  616.587126]  [<ffffffff810db8b3>] sys_write+0x4a/0x71
[  616.587126]  [<ffffffff8100221b>] system_call_fastpath+0x16/0x1b
[  616.587126] Code: 3b 56 10 73 1e 48 83 fa f2 74 18 48 8d 4d cc 4d 89 f8 48 89 df e8 02 f2 ff ff 41 01 c4 83 7d cc 00 74 19 4d 8b 6d 20 49 83 ed 20 <49> 8b 45 20 0f 18 08 49 8d 45 20 48 39 45 80 75 aa 4c 89 f7 e8 
[  616.587126] RIP  [<ffffffff810c614f>] page_referenced+0xee/0x1dc
[  616.587126]  RSP <ffff88022b3258b8>
[  616.587126] CR2: 0000000000000000
[  616.600838] ---[ end trace 0ea0c6b4ead21c8f ]---
[  616.600984] note: hib.sh[3453] exited with preempt_count 2
[  616.601282] BUG: scheduling while atomic: hib.sh/3453/0x10000003
[  616.601431] INFO: lockdep is turned off.
[  616.601584] Modules linked in: powernow_k8 cpufreq_ondemand cpufreq_powersave cpufreq_userspace freq_table cpufreq_conservative binfmt_misc kvm_amd kvm ipv6 vfat fat dm_crypt dm_mod ohci_hcd edac_core 8250_pnp 8250 serial_core pcspkr k10temp
[  616.603115] Pid: 3453, comm: hib.sh Tainted: G      D W  2.6.34-rc3-00505-g1d9bb34 #1
[  616.603460] Call Trace:
[  616.603605]  [<ffffffff810658df>] ? __debug_show_held_locks+0x1b/0x24
[  616.603755]  [<ffffffff8102dfac>] __schedule_bug+0x72/0x77
[  616.603903]  [<ffffffff8140c298>] schedule+0xe3/0x7ff
[  616.604051]  [<ffffffff810bd0e4>] ? unmap_vmas+0x90c/0x911
[  616.604230]  [<ffffffff81030ecb>] __cond_resched+0x18/0x24
[  616.604381]  [<ffffffff8140ca81>] _cond_resched+0x2c/0x37
[  616.604529]  [<ffffffff810bcef1>] unmap_vmas+0x719/0x911
[  616.604678]  [<ffffffff810c16c0>] exit_mmap+0x102/0x1e4
[  616.604826]  [<ffffffff810c1627>] ? exit_mmap+0x69/0x1e4
[  616.604975]  [<ffffffff810368bc>] mmput+0x48/0xb9
[  616.605124]  [<ffffffff8103ad90>] exit_mm+0x110/0x11d
[  616.605280]  [<ffffffff8103c9e6>] do_exit+0x1c5/0x6e5
[  616.605430]  [<ffffffff81039e2f>] ? kmsg_dump+0x13b/0x155
[  616.605579]  [<ffffffff8100616b>] ? oops_end+0x47/0x93
[  616.605727]  [<ffffffff810061b2>] oops_end+0x8e/0x93
[  616.605875]  [<ffffffff8101f3e5>] no_context+0x1fc/0x20b
[  616.606023]  [<ffffffff8101f580>] __bad_area_nosemaphore+0x18c/0x1af
[  616.606176]  [<ffffffff8101f7bb>] ? do_page_fault+0xa8/0x32d
[  616.606330]  [<ffffffff8101f5b6>] bad_area_nosemaphore+0x13/0x15
[  616.606479]  [<ffffffff8101f886>] do_page_fault+0x173/0x32d
[  616.606628]  [<ffffffff81410463>] ? error_sti+0x5/0x6
[  616.606776]  [<ffffffff81065387>] ? trace_hardirqs_off_caller+0x1f/0xa9
[  616.606926]  [<ffffffff8140edab>] ? trace_hardirqs_off_thunk+0x3a/0x3c
[  616.607076]  [<ffffffff8141027f>] page_fault+0x1f/0x30
[  616.607227]  [<ffffffff810c5e7a>] ? page_lock_anon_vma+0x0/0xbb
[  616.607381]  [<ffffffff810c614f>] ? page_referenced+0xee/0x1dc
[  616.607530]  [<ffffffff810c60e1>] ? page_referenced+0x80/0x1dc
[  616.607678]  [<ffffffff810c5fd7>] ? try_to_unmap_anon+0xa2/0xb4
[  616.607827]  [<ffffffff810b06bc>] shrink_page_list+0x154/0x4c7
[  616.607976]  [<ffffffff81067149>] ? print_lock_contention_bug+0x1b/0xe1
[  616.608131]  [<ffffffff810af59c>] ? isolate_pages_global+0xd0/0x1fc
[  616.608284]  [<ffffffff8140fb06>] ? _raw_spin_unlock_irq+0x30/0x58
[  616.608435]  [<ffffffff810b0d8a>] shrink_inactive_list+0x35b/0x60c
[  616.608585]  [<ffffffff810b0556>] ? shrink_active_list+0x232/0x244
[  616.608734]  [<ffffffff810b1347>] shrink_zone+0x30c/0x3d6
[  616.608883]  [<ffffffff810b1f3d>] do_try_to_free_pages+0x191/0x29a
[  616.609031]  [<ffffffff810b20db>] shrink_all_memory+0x95/0xc4
[  616.609183]  [<ffffffff810af4cc>] ? isolate_pages_global+0x0/0x1fc
[  616.609337]  [<ffffffff81079c9c>] ? count_data_pages+0x65/0x79
[  616.609486]  [<ffffffff81079f03>] hibernate_preallocate_memory+0x1aa/0x2cb
[  616.609636]  [<ffffffff8140be84>] ? printk+0x41/0x45
[  616.609784]  [<ffffffff8107878f>] hibernation_snapshot+0x36/0x1e1
[  616.609933]  [<ffffffff81078a08>] hibernate+0xce/0x172
[  616.610080]  [<ffffffff81077775>] state_store+0x5c/0xd3
[  616.610233]  [<ffffffff8118f687>] kobj_attr_store+0x17/0x19
[  616.610383]  [<ffffffff8112e540>] sysfs_write_file+0x108/0x144
[  616.610532]  [<ffffffff810db74f>] vfs_write+0xb2/0x153
[  616.610680]  [<ffffffff810663c9>] ? trace_hardirqs_on_caller+0x1f/0x14b
[  616.610830]  [<ffffffff810db8b3>] sys_write+0x4a/0x71
[  616.610978]  [<ffffffff8100221b>] system_call_fastpath+0x16/0x1b
[  682.501863] SysRq : HELP : loglevel(0-9) reBoot Crash show-all-locks(D) terminate-all-tasks(E) memory-full-oom-kill(F) kill-all-tasks(I) thaw-filesystems(J) saK show-backtrace-all-active-cpus(L) show-memory-usage(M) nice-all-RT-tasks(N) powerOff show-registers(P) show-all-timers(Q) unRaw Sync show-task-states(T) Unmount show-blocked-tasks(W) dump-ftrace-buffer(Z) 
[  683.552767] SysRq : Emergency Sync
[  683.553147] Emergency Sync complete
[  684.180708] SysRq : Emergency Remount R/O
[  684.927560] SysRq : Resetting

-- 
Regards/Gruss,
    Boris.

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA
  2010-04-11 17:16                                                                                                                                     ` Linus Torvalds
  2010-04-11 18:55                                                                                                                                       ` Borislav Petkov
@ 2010-04-11 19:49                                                                                                                                       ` Rik van Riel
  2010-04-12 15:44                                                                                                                                         ` Linus Torvalds
  2010-04-11 21:45                                                                                                                                       ` Rik van Riel
  2 siblings, 1 reply; 242+ messages in thread
From: Rik van Riel @ 2010-04-11 19:49 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Borislav Petkov, Johannes Weiner, KOSAKI Motohiro, Andrew Morton,
	Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn,
	Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson

On 04/11/2010 01:16 PM, Linus Torvalds wrote:

> NOTE! If this is the race, then the hack really is just a hack, because it
> doesn't really solve anything. We still take the spinlock, and if bad
> things has happened, _that_ can still very much fail, and you get the
> watchdog lockup message instead. So this doesn't really fix anything.

Looking around the code some more, zap_pte_range()
calls page_remove_rmap(), which leaves the
page->mapping in place and has this comment:

         /*
          * It would be tidy to reset the PageAnon mapping here,
          * but that might overwrite a racing page_add_anon_rmap
          * which increments mapcount after us but sets mapping
          * before us: so leave the reset to free_hot_cold_page,
          * and remember that it's only reliable while mapped.
          * Leaving it set also helps swapoff to reinstate ptes
          * faster for those pages still in swapcache.
          */

I wonder if we can clear page->mapping here, if
list_is_singular(anon_vma->head).  That way we
will not leave stale pointers behind.

Adding another VMA to the anon_vma can happen
at fork time - which will not happen simultaneously
with exit or munmap, because the mmap_sem is taken
for write during either code path.

Am I overlooking something here?

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA
  2010-04-11 17:16                                                                                                                                     ` Linus Torvalds
  2010-04-11 18:55                                                                                                                                       ` Borislav Petkov
  2010-04-11 19:49                                                                                                                                       ` [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA Rik van Riel
@ 2010-04-11 21:45                                                                                                                                       ` Rik van Riel
  2010-04-12 15:51                                                                                                                                         ` Linus Torvalds
  2 siblings, 1 reply; 242+ messages in thread
From: Rik van Riel @ 2010-04-11 21:45 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Borislav Petkov, Johannes Weiner, KOSAKI Motohiro, Andrew Morton,
	Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn,
	Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson

On 04/11/2010 01:16 PM, Linus Torvalds wrote:

> Actually, so if it's that race, then we might get rid of the oops with
> this total hack.

Another thing I just thought of.

The anon_vma struct will not be reused for something completely
different due to the SLAB_DESTROY_BY_RCU flag that the anon_vma_cachep
is created with.

The anon_vma_chain structs are allocated from a slab without that
flag, so they can be reused for something else in the middle of
an RCU section.

Is that something worth fixing, or is this so subtle that we'd
rather not have the code rely on this kind of behaviour at all?

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA
  2010-04-11 18:55                                                                                                                                       ` Borislav Petkov
@ 2010-04-12  0:13                                                                                                                                         ` Linus Torvalds
  2010-04-12  1:04                                                                                                                                           ` Linus Torvalds
  0 siblings, 1 reply; 242+ messages in thread
From: Linus Torvalds @ 2010-04-12  0:13 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Johannes Weiner, KOSAKI Motohiro, Rik van Riel, Andrew Morton,
	Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn,
	Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson



On Sun, 11 Apr 2010, Borislav Petkov wrote:
> 
> > Conversely, if you still see the oops (rather than the watchdog), that 
> > means that we actually have pages that are still marked mapped, and that 
> > despite that mapped state have a stale page->mapping pointer. I actually 
> > find that the more likely case, because otherwise the window is _so_ small 
> > that I don't see how you can hit the oops so reliably.
> 
> Ok, did test with the all 5 patches applied. It oopsed with the same
> trace, see below. Except one kernel/sched.c:3555 warning checking
> spinlock count overflowing, nothing else. :(

Ok, that preempt-count thing is a real problem, but should be unrelated to 
your issues.

Anyway, so this all means that we definitely have lost sight of an 
'anon_vma', even if page->mapping still points to it, and even though the 
page is still mapped.

I'll see if I can come up with a patch to do the same kind of validation 
on page->mapping as on the anon-vma chains themselves.

> I tried to see whether the page->mapping pointer is stale, I dunno,
> maybe there could be something in the register dump which could tell us
> what's happening.

Sadly, you cannot tell by the pointer. A stale pointer still is a 
perfectly fine kernel pointer, it's just that we've long since released 
the anon_vma it used to point to, and now it points to some random other 
data structure.

> So, it really looks like at least that list_head in anon_vma is
> bollocks, or even the whole anon_vma. So if this is correct, it is
> highly likely that the anon_vma is already freed material or not
> initialized at all.

Yes, it's pretty certain it is long free'd, and re-allocated to something 
else.

		Linus

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA
  2010-04-12  0:13                                                                                                                                         ` Linus Torvalds
@ 2010-04-12  1:04                                                                                                                                           ` Linus Torvalds
  2010-04-12  7:20                                                                                                                                             ` Borislav Petkov
  0 siblings, 1 reply; 242+ messages in thread
From: Linus Torvalds @ 2010-04-12  1:04 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Johannes Weiner, KOSAKI Motohiro, Rik van Riel, Andrew Morton,
	Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn,
	Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson



On Sun, 11 Apr 2010, Linus Torvalds wrote:
> 
> I'll see if I can come up with a patch to do the same kind of validation 
> on page->mapping as on the anon-vma chains themselves.

Ok, this may or may not work. It hasn't triggered for me, which may be 
because it's broken, but maybe it's because I'm not doing whatever it is 
you are doing to break our VM.

It checks each anonymous page at unmap time against the vma it gets 
unmapped from. It depends on the previous vma_verify debugging patch, and 
it would be interesting to hear whether this patch causes any new warnngs 
for you..

If the warnings do happen, they are not going to be printing out any 
hugely informative data apart from the fact that the bad case happened at 
all. But If they do trigger, I can try to improve on them - it's just not 
worth trying to make them any more interesting if they never trigger.

		Linus

---
 mm/memory.c |   21 +++++++++++++++++++++
 mm/mmap.c   |    2 +-
 2 files changed, 22 insertions(+), 1 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 833952d..5d2df59 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -890,6 +890,25 @@ int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	return ret;
 }
 
+extern void verify_vma(struct vm_area_struct *);
+
+static void verify_anon_page(struct vm_area_struct *vma, struct page *page)
+{
+	struct anon_vma *anon_vma = vma->anon_vma;
+	struct anon_vma *need_anon_vma = page_anon_vma(page);
+	struct anon_vma_chain *avc;
+
+	verify_vma(vma);
+	if (WARN_ONCE(!anon_vma, "anonymous page in vma without anon_vma"))
+		return;
+	list_for_each_entry(avc, &vma->anon_vma_chain, same_vma) {
+		WARN_ONCE(avc->vma != vma, "anon_vma_chain vma entry doesn't match");
+		if (avc->anon_vma == need_anon_vma)
+			return;
+	}
+	WARN_ONCE(1, "page->mapping does not exist in vma chain");
+}
+
 static unsigned long zap_pte_range(struct mmu_gather *tlb,
 				struct vm_area_struct *vma, pmd_t *pmd,
 				unsigned long addr, unsigned long end,
@@ -940,6 +959,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 			tlb_remove_tlb_entry(tlb, pte, addr);
 			if (unlikely(!page))
 				continue;
+			if (PageAnon(page))
+				verify_anon_page(vma, page);
 			if (unlikely(details) && details->nonlinear_vma
 			    && linear_page_index(details->nonlinear_vma,
 						addr) != page->index)
diff --git a/mm/mmap.c b/mm/mmap.c
index 890c169..461f59c 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1565,7 +1565,7 @@ get_unmapped_area(struct file *file, unsigned long addr, unsigned long len,
 
 EXPORT_SYMBOL(get_unmapped_area);
 
-static void verify_vma(struct vm_area_struct *vma)
+void verify_vma(struct vm_area_struct *vma)
 {
 	if (vma->anon_vma) {
 		struct anon_vma_chain *avc;

^ permalink raw reply related	[flat|nested] 242+ messages in thread

* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA
  2010-04-12  1:04                                                                                                                                           ` Linus Torvalds
@ 2010-04-12  7:20                                                                                                                                             ` Borislav Petkov
  2010-04-12 16:02                                                                                                                                               ` Linus Torvalds
  0 siblings, 1 reply; 242+ messages in thread
From: Borislav Petkov @ 2010-04-12  7:20 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Johannes Weiner, KOSAKI Motohiro, Rik van Riel, Andrew Morton,
	Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn,
	Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson

From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Sun, Apr 11, 2010 at 06:04:39PM -0700

> It checks each anonymous page at unmap time against the vma it gets 
> unmapped from. It depends on the previous vma_verify debugging patch, and 
> it would be interesting to hear whether this patch causes any new warnngs 
> for you..
> 
> If the warnings do happen, they are not going to be printing out any 
> hugely informative data apart from the fact that the bad case happened at 
> all. But If they do trigger, I can try to improve on them - it's just not 
> worth trying to make them any more interesting if they never trigger.

Haa, I think you're gonna want to improve them :)

	WARN_ONCE(1, "page->mapping does not exist in vma chain");

triggered on the first resume showing a rather messy 4 WARN_ONCEs. Had I
more cores, there maybe would've been more of them :) Maybe need locking
if clean output is of interest (see below).

So, anyway, if I can read this correctly, there is a page->mapping
anon_vma which is _not_ in the anon_vmas chain of the vma
(avc->same_vma).

And the spot we oops on is in page_referenced_anon():

	list_for_each_entry(avc, &anon_vma->head, same_anon_vma) {

which is actually where we iterate over all vmas associated with this
anon_vma.

So if that previous anon_vma pointed to by the page_mapping has been
falsely unlinked at some point, no wonder we boom on that later.

By the way, I completely understand when you say that your head hurts
from looking at this :).


[  486.580872] Restarting tasks ... done.
[  494.167242] [drm] Resetting GPU
[  495.422354] ------------[ cut here ]------------
[  495.422407] WARNING: at mm/memory.c:909 unmap_vmas+0x548/0xa29()
[  495.422442] Hardware name: System Product Name
[  495.422474] page->mapping does not exist in vma chain
[  495.422504] Modules linked in:
[  495.422545] ------------[ cut here ]------------
[  495.422555] ------------[ cut here ]------------
[  495.422565]  powernow_k8
[  495.422583] WARNING: at mm/memory.c:909 unmap_vmas+0x548/0xa29()
[  495.422591]  cpufreq_ondemand
[  495.422597] Hardware name: System Product Name
[  495.422602] page->mapping does not exist in vma chain cpufreq_powersave
[  495.422612] Modules linked in: cpufreq_userspace powernow_k8 cpufreq_ondemand cpufreq_powersave cpufreq_userspace freq_table freq_table cpufreq_conservative cpufreq_conservative binfmt_misc binfmt_misc kvm_amd kvm ipv6 vfat fat dm_crypt kvm_amd dm_mod 8250_pnp kvm 8250 serial_core edac_core pcspkr k10temp ohci_hcd
[  495.422676]  ipv6Pid: 2919, comm: udevd Tainted: G        W  2.6.34-rc3-00506-g6c62fe4 #1
[  495.422689] Call Trace:
[  495.422694]  vfat
[  495.422700] ------------[ cut here ]------------
[  495.422721] WARNING: at mm/memory.c:909 unmap_vmas+0x548/0xa29()
[  495.422729]  fat [<ffffffff81038fe0>] warn_slowpath_common+0x7c/0x94
[  495.422746]  dm_crypt
[  495.422751] Hardware name: System Product Name
[  495.422758]  dm_modpage->mapping does not exist in vma chain
[  495.422767] Modules linked in: 8250_pnp [<ffffffff8103904f>] warn_slowpath_fmt+0x41/0x43
[  495.422784]  powernow_k8 cpufreq_ondemand 8250 cpufreq_powersave [<ffffffff810bcd20>] unmap_vmas+0x548/0xa29
[  495.422807]  serial_core cpufreq_userspace [<ffffffff810bd021>] ? unmap_vmas+0x849/0xa29
[  495.422828]  edac_core freq_table pcspkr cpufreq_conservative [<ffffffff810c17d8>] exit_mmap+0x102/0x1e4
[  495.422851]  binfmt_misc [<ffffffff810c173f>] ? exit_mmap+0x69/0x1e4
[  495.422863]  k10temp [<ffffffff810368bc>] mmput+0x48/0xb9
[  495.422876]  kvm_amd [<ffffffff8103ad90>] exit_mm+0x110/0x11d
[  495.422889]  ohci_hcd kvm
[  495.422903]  [<ffffffff8103c9e6>] do_exit+0x1c5/0x6e5
[  495.422909]  ipv6Pid: 2916, comm: udevd Tainted: G        W  2.6.34-rc3-00506-g6c62fe4 #1
[  495.422927]  [<ffffffff81065387>] ? trace_hardirqs_off_caller+0x1f/0xa9
[  495.422934] Call Trace:
[  495.422940]  vfat [<ffffffff8141016d>] ? retint_swapgs+0xe/0x13
[  495.422956]  fat [<ffffffff81038fe0>] warn_slowpath_common+0x7c/0x94
[  495.422972]  dm_crypt dm_mod 8250_pnp [<ffffffff8103cf8a>] do_group_exit+0x84/0xb0
[  495.422989]  8250 serial_core [<ffffffff8103cfcd>] sys_exit_group+0x17/0x1b
[  495.423013]  [<ffffffff8100221b>] system_call_fastpath+0x16/0x1b
[  495.423019]  edac_core
[  495.423025] ---[ end trace d9664ac54d1edb0e ]---
[  495.423031]  pcspkr k10temp ohci_hcd
[  495.423043] Pid: 2914, comm: udevd Tainted: G        W  2.6.34-rc3-00506-g6c62fe4 #1
[  495.423055]  [<ffffffff8103904f>] warn_slowpath_fmt+0x41/0x43
[  495.423063] Call Trace:
[  495.423073]  [<ffffffff810bcd20>] unmap_vmas+0x548/0xa29
[  495.423087]  [<ffffffff81038fe0>] warn_slowpath_common+0x7c/0x94
[  495.423100]  [<ffffffff810bd021>] ? unmap_vmas+0x849/0xa29
[  495.423111]  [<ffffffff8103904f>] warn_slowpath_fmt+0x41/0x43
[  495.423123]  [<ffffffff810bcd20>] unmap_vmas+0x548/0xa29
[  495.423134]  [<ffffffff810bd021>] ? unmap_vmas+0x849/0xa29
[  495.423147]  [<ffffffff810c17d8>] exit_mmap+0x102/0x1e4
[  495.423159]  [<ffffffff810c17d8>] exit_mmap+0x102/0x1e4
[  495.423172]  [<ffffffff810c173f>] ? exit_mmap+0x69/0x1e4
[  495.423184]  [<ffffffff810c173f>] ? exit_mmap+0x69/0x1e4
[  495.423194]  [<ffffffff810368bc>] mmput+0x48/0xb9
[  495.423204]  [<ffffffff810368bc>] mmput+0x48/0xb9
[  495.423214]  [<ffffffff8103ad90>] exit_mm+0x110/0x11d
[  495.423225]  [<ffffffff8103ad90>] exit_mm+0x110/0x11d
[  495.423236]  [<ffffffff8103c9e6>] do_exit+0x1c5/0x6e5
[  495.423246]  [<ffffffff8103c9e6>] do_exit+0x1c5/0x6e5
[  495.423266]  [<ffffffff81065387>] ? trace_hardirqs_off_caller+0x1f/0xa9
[  495.423277]  [<ffffffff81065387>] ? trace_hardirqs_off_caller+0x1f/0xa9
[  495.423292]  [<ffffffff8141016d>] ? retint_swapgs+0xe/0x13
[  495.423303]  [<ffffffff8141016d>] ? retint_swapgs+0xe/0x13
[  495.423315]  [<ffffffff8103cf8a>] do_group_exit+0x84/0xb0
[  495.423325]  [<ffffffff8103cfcd>] sys_exit_group+0x17/0x1b
[  495.423334]  [<ffffffff8103cf8a>] do_group_exit+0x84/0xb0
[  495.423346]  [<ffffffff8100221b>] system_call_fastpath+0x16/0x1b
[  495.423357]  [<ffffffff8103cfcd>] sys_exit_group+0x17/0x1b
[  495.423365] ---[ end trace d9664ac54d1edb0f ]---
[  495.423386]  [<ffffffff8100221b>] system_call_fastpath+0x16/0x1b
[  495.423402] ---[ end trace d9664ac54d1edb10 ]---
[  495.424191] WARNING: at mm/memory.c:909 unmap_vmas+0x548/0xa29()
[  495.424215] Hardware name: System Product Name
[  495.424238] page->mapping does not exist in vma chain
[  495.424259] Modules linked in: powernow_k8 cpufreq_ondemand cpufreq_powersave cpufreq_userspace freq_table cpufreq_conservative binfmt_misc kvm_amd kvm ipv6 vfat fat dm_crypt dm_mod 8250_pnp 8250 serial_core edac_core pcspkr k10temp ohci_hcd
[  495.424693] Pid: 1923, comm: udevd Tainted: G        W  2.6.34-rc3-00506-g6c62fe4 #1
[  495.424723] Call Trace:
[  495.424758]  [<ffffffff81038fe0>] warn_slowpath_common+0x7c/0x94
[  495.424788]  [<ffffffff8103904f>] warn_slowpath_fmt+0x41/0x43
[  495.424816]  [<ffffffff810bcd20>] unmap_vmas+0x548/0xa29
[  495.424843]  [<ffffffff810bd021>] ? unmap_vmas+0x849/0xa29
[  495.424875]  [<ffffffff810c17d8>] exit_mmap+0x102/0x1e4
[  495.424901]  [<ffffffff810c173f>] ? exit_mmap+0x69/0x1e4
[  495.424926]  [<ffffffff810368bc>] mmput+0x48/0xb9
[  495.424954]  [<ffffffff8103ad90>] exit_mm+0x110/0x11d
[  495.424981]  [<ffffffff8103c9e6>] do_exit+0x1c5/0x6e5
[  495.425008]  [<ffffffff81065387>] ? trace_hardirqs_off_caller+0x1f/0xa9
[  495.425038]  [<ffffffff8141016d>] ? retint_swapgs+0xe/0x13
[  495.425065]  [<ffffffff8103cf8a>] do_group_exit+0x84/0xb0
[  495.425091]  [<ffffffff8103cfcd>] sys_exit_group+0x17/0x1b
[  495.425119]  [<ffffffff8100221b>] system_call_fastpath+0x16/0x1b
[  495.425156] ---[ end trace d9664ac54d1edb11 ]---


-- 
Regards/Gruss,
    Boris.

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA
  2010-04-10 18:21                                                                                                             ` Linus Torvalds
                                                                                                                                 ` (2 preceding siblings ...)
  2010-04-10 19:36                                                                                                               ` Rik van Riel
@ 2010-04-12 14:40                                                                                                               ` Peter Zijlstra
  2010-04-12 15:17                                                                                                                 ` Minchan Kim
  2010-04-12 15:19                                                                                                                 ` Rik van Riel
  3 siblings, 2 replies; 242+ messages in thread
From: Peter Zijlstra @ 2010-04-12 14:40 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Borislav Petkov, Johannes Weiner, KOSAKI Motohiro, Rik van Riel,
	Andrew Morton, Minchan Kim, Linux Kernel Mailing List,
	Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins,
	sgunderson

On Sat, 2010-04-10 at 11:21 -0700, Linus Torvalds wrote:
> 

> Ho humm.
> 
> Maybe I'm crazy, but something started bothering me. And I started 
> wondering: when is the 'page->mapping' of an anonymous page actually 
> cleared?
> 
> The thing is, the mapping of an anonymous page is actually cleared only 
> when the page is _freed_, in "free_hot_cold_page()". 
> 
> Now, let's think about that. And in particular, let's think about how that 
> relates to the freeing of the 'anon_vma' that the page->mapping points to. 
> 
> The way the anon_vma is freed is when the mapping is torn down, and we do 
> roughly:
> 
> 	tlb = tlb_gather_mmu(mm,..)
> 	..
> 	unmap_vmas(&tlb, vma ..
> 	..
> 	free_pgtables()
> 	..
> 	tlb_finish_mmu(tlb, start, end);
> 
> and we actually unmap all the pages in "unmap_vmas()", and then _after_ 
> unmapping all the pages we do the "unlink_anon_vmas(vma);" in 
> "free_pgtables()". Fine so far - the anon_vma stay around until after the 
> page has been happily unmapped.
> 
> But "unmapped all the pages" is _not_ actually the same as "free'd all the 
> pages". The actual _freeing_ of the page happens generally in 
> tlb_finish_mmu(), because we can free the page only after we've flushed 
> any TLB entries.
> 
> So what we have in that tlb_gather structure is a list of _pending_ pages 
> to be freed, while we already actually free'd the anon_vmas earlier!
> 
> Now, the thing is, tlb_gather_mmu() begins a preempt-safe region (because 
> we use a per-cpu variable), but as far as I can tell it is _not_ an 
> RCU-safe region.
> 
> So I think we might actually get a real RCU freeing event while this all
> happens. So now the 'anon_vma' that 'page->mapping' points to has not just 
> been released back to the SLUB caches, the page itself might have been 
> released too.
> 
> I dunno. Does the above sound at all sane? Or am I just raving?
> 
> Something hacky like the above might fix it if I'm not just raving. I 
> really might be missing something here.

Right, so unless you have CONFIG_TREE_PREEMPT_RCU=y, the preempt-disable
== RCU read lock assumption does hold.

But even with your patch it doesn't close all holes because while
zap_pte_range() can remove the last mapcount of the page, the
page_remove_tlb() et al. don't need to be the last use count of the
page.

Concurrent reclaim/gup/whatever could still have a count out on the page
delaying the actual free beyond the tlb gather RCU section.

So the reason page->mapping isn't cleared in page_remove_rmap() isn't
detailed beyond a (possible) race with page_add_anon_rmap() (which I
guess would be reclaim trying to unmap the page and a fault re-instating
it).

This also complicates the whole page_lock_anon_vma() thing, so it would
be nice to be able to remove this race and clear page->mapping in
page_remove_rmap().

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas  of a mergeable VMA
  2010-04-12 14:40                                                                                                               ` Peter Zijlstra
@ 2010-04-12 15:17                                                                                                                 ` Minchan Kim
  2010-04-12 15:33                                                                                                                   ` Peter Zijlstra
  2010-04-12 15:19                                                                                                                 ` Rik van Riel
  1 sibling, 1 reply; 242+ messages in thread
From: Minchan Kim @ 2010-04-12 15:17 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Borislav Petkov, Johannes Weiner,
	KOSAKI Motohiro, Rik van Riel, Andrew Morton,
	Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin,
	Andrea Arcangeli, Hugh Dickins, sgunderson

On Mon, Apr 12, 2010 at 11:40 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Sat, 2010-04-10 at 11:21 -0700, Linus Torvalds wrote:
>>
>
>> Ho humm.
>>
>> Maybe I'm crazy, but something started bothering me. And I started
>> wondering: when is the 'page->mapping' of an anonymous page actually
>> cleared?
>>
>> The thing is, the mapping of an anonymous page is actually cleared only
>> when the page is _freed_, in "free_hot_cold_page()".
>>
>> Now, let's think about that. And in particular, let's think about how that
>> relates to the freeing of the 'anon_vma' that the page->mapping points to.
>>
>> The way the anon_vma is freed is when the mapping is torn down, and we do
>> roughly:
>>
>>       tlb = tlb_gather_mmu(mm,..)
>>       ..
>>       unmap_vmas(&tlb, vma ..
>>       ..
>>       free_pgtables()
>>       ..
>>       tlb_finish_mmu(tlb, start, end);
>>
>> and we actually unmap all the pages in "unmap_vmas()", and then _after_
>> unmapping all the pages we do the "unlink_anon_vmas(vma);" in
>> "free_pgtables()". Fine so far - the anon_vma stay around until after the
>> page has been happily unmapped.
>>
>> But "unmapped all the pages" is _not_ actually the same as "free'd all the
>> pages". The actual _freeing_ of the page happens generally in
>> tlb_finish_mmu(), because we can free the page only after we've flushed
>> any TLB entries.
>>
>> So what we have in that tlb_gather structure is a list of _pending_ pages
>> to be freed, while we already actually free'd the anon_vmas earlier!
>>
>> Now, the thing is, tlb_gather_mmu() begins a preempt-safe region (because
>> we use a per-cpu variable), but as far as I can tell it is _not_ an
>> RCU-safe region.
>>
>> So I think we might actually get a real RCU freeing event while this all
>> happens. So now the 'anon_vma' that 'page->mapping' points to has not just
>> been released back to the SLUB caches, the page itself might have been
>> released too.
>>
>> I dunno. Does the above sound at all sane? Or am I just raving?
>>
>> Something hacky like the above might fix it if I'm not just raving. I
>> really might be missing something here.
>
> Right, so unless you have CONFIG_TREE_PREEMPT_RCU=y, the preempt-disable
> == RCU read lock assumption does hold.

Indeed.

>
> But even with your patch it doesn't close all holes because while
> zap_pte_range() can remove the last mapcount of the page, the
> page_remove_tlb() et al. don't need to be the last use count of the
> page.
>
> Concurrent reclaim/gup/whatever could still have a count out on the page
> delaying the actual free beyond the tlb gather RCU section.

anon_vma lock is just valid in case of page_mapped.
if reclaim/gup/whatever want to use anon_vma, it should check with page_mapped.
And last put_page doesn't touch anon_vma for freeing the page so I
think it's not a problem. Do I miss something?

>
> This also complicates the whole page_lock_anon_vma() thing, so it would
> be nice to be able to remove this race and clear page->mapping in
> page_remove_rmap().
>

BTW, I totally agree with you.
Now anon_vma is very complicated.
SLAB_DESTROY_BY_RCU, vma merge, when page->mapping is cleared,
anon_vma_chain and so on.. :(


-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA
  2010-04-12 14:40                                                                                                               ` Peter Zijlstra
  2010-04-12 15:17                                                                                                                 ` Minchan Kim
@ 2010-04-12 15:19                                                                                                                 ` Rik van Riel
  2010-04-12 16:01                                                                                                                   ` Peter Zijlstra
  1 sibling, 1 reply; 242+ messages in thread
From: Rik van Riel @ 2010-04-12 15:19 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Borislav Petkov, Johannes Weiner,
	KOSAKI Motohiro, Andrew Morton, Minchan Kim,
	Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin,
	Andrea Arcangeli, Hugh Dickins, sgunderson

On 04/12/2010 10:40 AM, Peter Zijlstra wrote:

> So the reason page->mapping isn't cleared in page_remove_rmap() isn't
> detailed beyond a (possible) race with page_add_anon_rmap() (which I
> guess would be reclaim trying to unmap the page and a fault re-instating
> it).
>
> This also complicates the whole page_lock_anon_vma() thing, so it would
> be nice to be able to remove this race and clear page->mapping in
> page_remove_rmap().

For anonymous pages, I don't see where the race comes from.

Both do_swap_page and the reclaim code hold the page lock
across the entire operation, so they are already excluding
each other.

Hugh, do you remember what the race between page_remove_rmap
and page_add_anon_rmap is/was all about?

I don't see a race in the current code...

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas  of a mergeable VMA
  2010-04-12 15:17                                                                                                                 ` Minchan Kim
@ 2010-04-12 15:33                                                                                                                   ` Peter Zijlstra
  0 siblings, 0 replies; 242+ messages in thread
From: Peter Zijlstra @ 2010-04-12 15:33 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Linus Torvalds, Borislav Petkov, Johannes Weiner,
	KOSAKI Motohiro, Rik van Riel, Andrew Morton,
	Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin,
	Andrea Arcangeli, Hugh Dickins, sgunderson

On Tue, 2010-04-13 at 00:17 +0900, Minchan Kim wrote:
> > Concurrent reclaim/gup/whatever could still have a count out on the page
> > delaying the actual free beyond the tlb gather RCU section.
> 
> anon_vma lock is just valid in case of page_mapped.
> if reclaim/gup/whatever want to use anon_vma, it should check with page_mapped.
> And last put_page doesn't touch anon_vma for freeing the page so I
> think it's not a problem. Do I miss something?

Hmm, I think you're right. The race I was thinking of makes the
page_lock_anon_vma() RCU section overlap with that of the mmu_gather,
which ensures the thing is long enough, or hits the !_mapcount case.

I'm not sure there are other page->mapping users that are interesting.


^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA
  2010-04-11 19:49                                                                                                                                       ` [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA Rik van Riel
@ 2010-04-12 15:44                                                                                                                                         ` Linus Torvalds
  2010-04-12 15:51                                                                                                                                           ` Rik van Riel
  0 siblings, 1 reply; 242+ messages in thread
From: Linus Torvalds @ 2010-04-12 15:44 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Borislav Petkov, Johannes Weiner, KOSAKI Motohiro, Andrew Morton,
	Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn,
	Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson



On Sun, 11 Apr 2010, Rik van Riel wrote:
> 
> Looking around the code some more, zap_pte_range()
> calls page_remove_rmap(), which leaves the
> page->mapping in place and has this comment:

See my earlier email about this exact issue. It's well-known that there 
are stale page->mapping pointers. The "page_mapped()" check _should_ have 
meant that in that case we never follow them, though.

> I wonder if we can clear page->mapping here, if
> list_is_singular(anon_vma->head).  That way we
> will not leave stale pointers behind.

What does that help? What if list _isn't_ singular?

		Linus

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA
  2010-04-11 21:45                                                                                                                                       ` Rik van Riel
@ 2010-04-12 15:51                                                                                                                                         ` Linus Torvalds
  2010-04-13 10:36                                                                                                                                           ` KOSAKI Motohiro
  0 siblings, 1 reply; 242+ messages in thread
From: Linus Torvalds @ 2010-04-12 15:51 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Borislav Petkov, Johannes Weiner, KOSAKI Motohiro, Andrew Morton,
	Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn,
	Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson



On Sun, 11 Apr 2010, Rik van Riel wrote:
>
> Another thing I just thought of.
> 
> The anon_vma struct will not be reused for something completely
> different due to the SLAB_DESTROY_BY_RCU flag that the anon_vma_cachep
> is created with.

Rik, we _know_ it got re-used by something totally different. That's 
clearly the problem. The page->mapping pointer does _not_ point to an 
anon_vma any more. That's the problem here.

What we need to figure out is how we have a page on the LRU list that is 
still marked as 'mapped' that has that stale mapping pointer.

I can easily see how the stale mapping pointer happens for a non-mapped 
page. That part is trivial. Here's a simple case:

 - vmscan does that whole "isolate LRU pages", and one of them is a (at 
   that time mapped) anonymous page. It's now not on any LRU lists at all.

 - vmscan ends up waiting for pageout and/or writeback while holding that 
   list of pages.

 - in the meantime, the process that had the page exists or unmaps, 
   unmapping the page and freeing the vma and the anon_vma.

 - vmscan eventually gets to the page, and does that page_referenced() 
   dance. page->mapping points to something that is long long gone (as in 
   "IO access lifetimes", so we're talking something that has been freed 
   literally milliseconds ago, rather than any RCU delays)

So I can see the stale page->mapping pointer happening. That part is even 
trivial. What I don't see is how the page would be still marked 'mapped'. 
Everything that actually free's the vma/anon_vmas should also have 
unmapped the page before that - even if it didn't _free_ the page.

			Linus

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA
  2010-04-12 15:44                                                                                                                                         ` Linus Torvalds
@ 2010-04-12 15:51                                                                                                                                           ` Rik van Riel
  0 siblings, 0 replies; 242+ messages in thread
From: Rik van Riel @ 2010-04-12 15:51 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Borislav Petkov, Johannes Weiner, KOSAKI Motohiro, Andrew Morton,
	Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn,
	Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson

On 04/12/2010 11:44 AM, Linus Torvalds wrote:
> On Sun, 11 Apr 2010, Rik van Riel wrote:
>>
>> Looking around the code some more, zap_pte_range()
>> calls page_remove_rmap(), which leaves the
>> page->mapping in place and has this comment:
>
> See my earlier email about this exact issue. It's well-known that there
> are stale page->mapping pointers. The "page_mapped()" check _should_ have
> meant that in that case we never follow them, though.

Good point.  I wonder if we have some SMP reordering
issue then?

>> I wonder if we can clear page->mapping here, if
>> list_is_singular(anon_vma->head).  That way we
>> will not leave stale pointers behind.
>
> What does that help? What if list _isn't_ singular?

Yeah, that was a bad idea.  Looking at the same code for
11 days straight seems to have put some knots in my brain :)

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA
  2010-04-12 15:19                                                                                                                 ` Rik van Riel
@ 2010-04-12 16:01                                                                                                                   ` Peter Zijlstra
  2010-04-12 16:06                                                                                                                     ` Rik van Riel
  2010-04-13 10:53                                                                                                                     ` KOSAKI Motohiro
  0 siblings, 2 replies; 242+ messages in thread
From: Peter Zijlstra @ 2010-04-12 16:01 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Linus Torvalds, Borislav Petkov, Johannes Weiner,
	KOSAKI Motohiro, Andrew Morton, Minchan Kim,
	Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin,
	Andrea Arcangeli, Hugh Dickins, sgunderson

On Mon, 2010-04-12 at 11:19 -0400, Rik van Riel wrote:
> On 04/12/2010 10:40 AM, Peter Zijlstra wrote:
> 
> > So the reason page->mapping isn't cleared in page_remove_rmap() isn't
> > detailed beyond a (possible) race with page_add_anon_rmap() (which I
> > guess would be reclaim trying to unmap the page and a fault re-instating
> > it).
> >
> > This also complicates the whole page_lock_anon_vma() thing, so it would
> > be nice to be able to remove this race and clear page->mapping in
> > page_remove_rmap().
> 
> For anonymous pages, I don't see where the race comes from.
> 
> Both do_swap_page and the reclaim code hold the page lock
> across the entire operation, so they are already excluding
> each other.
> 
> Hugh, do you remember what the race between page_remove_rmap
> and page_add_anon_rmap is/was all about?
> 
> I don't see a race in the current code...


Something like the below would be nice if possible.


---
 mm/rmap.c |   44 +++++++++++++++++++++++++++++++-------------
 1 files changed, 31 insertions(+), 13 deletions(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index eaa7a09..241f75d 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -286,7 +286,22 @@ void __init anon_vma_init(void)
 
 /*
  * Getting a lock on a stable anon_vma from a page off the LRU is
- * tricky: page_lock_anon_vma rely on RCU to guard against the races.
+ * tricky: 
+ *
+ *  page_add_anon_vma()
+ *   atomic_add_negative(page->_mapcount);
+ *   page->mapping = anon_vma;
+ *
+ *
+ *  page_remove_rmap()
+ *   atomic_add_negative();
+ *   page->mapping = anon_vma;
+ *
+ * So we have to first read page->mapping(), and then verify
+ * _mapcount, and make sure we order them correctly.
+ *
+ * We take anon_vma->lock in between so that if we see the anon_vma
+ * with a mapcount we know it won't go away on us.
  */
 struct anon_vma *page_lock_anon_vma(struct page *page)
 {
@@ -294,14 +309,24 @@ struct anon_vma *page_lock_anon_vma(struct page *page)
 	unsigned long anon_mapping;
 
 	rcu_read_lock();
-	anon_mapping = (unsigned long) ACCESS_ONCE(page->mapping);
+	anon_mapping = (unsigned long)rcu_dereference(page->mapping);
 	if ((anon_mapping & PAGE_MAPPING_FLAGS) != PAGE_MAPPING_ANON)
 		goto out;
-	if (!page_mapped(page))
-		goto out;
 
 	anon_vma = (struct anon_vma *) (anon_mapping - PAGE_MAPPING_ANON);
 	spin_lock(&anon_vma->lock);
+
+	/*
+	 * Order the reading of page->mapping and page->_mapcount against the
+	 * mb() implied by the atomic_add_negative() in page_remove_rmap().
+	 */
+	smp_rmb();
+	if (!page_mapped(page)) {
+		spin_unlock(&anon_vma->lock);
+		anon_vma = NULL;
+		goto out;
+	}
+
 	return anon_vma;
 out:
 	rcu_read_unlock();
@@ -864,15 +889,8 @@ void page_remove_rmap(struct page *page)
 		__dec_zone_page_state(page, NR_FILE_MAPPED);
 		mem_cgroup_update_file_mapped(page, -1);
 	}
-	/*
-	 * It would be tidy to reset the PageAnon mapping here,
-	 * but that might overwrite a racing page_add_anon_rmap
-	 * which increments mapcount after us but sets mapping
-	 * before us: so leave the reset to free_hot_cold_page,
-	 * and remember that it's only reliable while mapped.
-	 * Leaving it set also helps swapoff to reinstate ptes
-	 * faster for those pages still in swapcache.
-	 */
+
+	page->mapping = NULL;
 }
 
 /*



^ permalink raw reply related	[flat|nested] 242+ messages in thread

* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA
  2010-04-12  7:20                                                                                                                                             ` Borislav Petkov
@ 2010-04-12 16:02                                                                                                                                               ` Linus Torvalds
  2010-04-12 16:26                                                                                                                                                 ` Linus Torvalds
  0 siblings, 1 reply; 242+ messages in thread
From: Linus Torvalds @ 2010-04-12 16:02 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Johannes Weiner, KOSAKI Motohiro, Rik van Riel, Andrew Morton,
	Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn,
	Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson



On Mon, 12 Apr 2010, Borislav Petkov wrote:
> > 
> > If the warnings do happen, they are not going to be printing out any 
> > hugely informative data apart from the fact that the bad case happened at 
> > all. But If they do trigger, I can try to improve on them - it's just not 
> > worth trying to make them any more interesting if they never trigger.
> 
> Haa, I think you're gonna want to improve them :)
> 
> 	WARN_ONCE(1, "page->mapping does not exist in vma chain");
> 
> triggered on the first resume showing a rather messy 4 WARN_ONCEs. Had I
> more cores, there maybe would've been more of them :) Maybe need locking
> if clean output is of interest (see below).

Goodie.

I can't trigger this on my machine (not that I tried very hard - but I did 
do some swapping loads etc by limiting my memory to just 1GB etc). So I'm 
pretty sure my verification code is "correct", and verifies things that 
should be right.

And the fact that it triggers under the exact load that you use to then 
trigger the bug is a damn good thing.  That means that we are finally on 
the right track, and we have somethign that correlates well with the 
actual bug.

> So, anyway, if I can read this correctly, there is a page->mapping
> anon_vma which is _not_ in the anon_vmas chain of the vma
> (avc->same_vma).

Yes, and that is supposed to be a no-no. The page is clearly associated 
with the vma in question (since we are unmapping it through that vma), but 
the vma list of 'anon_vma's doesn't actually have the one that 
'page->mapping' points to.

And that, in turn, means that we've lost sight of the 'page->mapping' 
anon_vma, and THAT in turn means that it could well have been free'd as 
being no longer referenced.

And if it was free'd, it could be re-allocated as something else (after 
the RCU grace period), and that directly explains your oops.

> By the way, I completely understand when you say that your head hurts
> from looking at this :).

Well, I have to say that I'm happy I've spent the time on it, because this 
way I got to learn all the new rules. It's just that I really wish I 
wouldn't have _had_ to.

Anyway, I'll have to think way more about this to see if I can come up 
with a debugging patch that shows more details about what actually caused 
this to happen in the first place. But we definitely have a smoking gun.

		Linus

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA
  2010-04-12 16:01                                                                                                                   ` Peter Zijlstra
@ 2010-04-12 16:06                                                                                                                     ` Rik van Riel
  2010-04-12 16:46                                                                                                                       ` Linus Torvalds
  2010-04-13 10:53                                                                                                                     ` KOSAKI Motohiro
  1 sibling, 1 reply; 242+ messages in thread
From: Rik van Riel @ 2010-04-12 16:06 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Borislav Petkov, Johannes Weiner,
	KOSAKI Motohiro, Andrew Morton, Minchan Kim,
	Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin,
	Andrea Arcangeli, Hugh Dickins, sgunderson

On 04/12/2010 12:01 PM, Peter Zijlstra wrote:

> @@ -864,15 +889,8 @@ void page_remove_rmap(struct page *page)
>   		__dec_zone_page_state(page, NR_FILE_MAPPED);
>   		mem_cgroup_update_file_mapped(page, -1);
>   	}
> -	/*
> -	 * It would be tidy to reset the PageAnon mapping here,
> -	 * but that might overwrite a racing page_add_anon_rmap
> -	 * which increments mapcount after us but sets mapping
> -	 * before us: so leave the reset to free_hot_cold_page,
> -	 * and remember that it's only reliable while mapped.
> -	 * Leaving it set also helps swapoff to reinstate ptes
> -	 * faster for those pages still in swapcache.
> -	 */
> +
> +	page->mapping = NULL;
>   }

That would be a bug for file pages :)

I could see how it could work for anonymous memory, though.

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA
  2010-04-12 16:02                                                                                                                                               ` Linus Torvalds
@ 2010-04-12 16:26                                                                                                                                                 ` Linus Torvalds
  2010-04-12 18:40                                                                                                                                                   ` Rik van Riel
  2010-04-12 21:50                                                                                                                                                   ` [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA Borislav Petkov
  0 siblings, 2 replies; 242+ messages in thread
From: Linus Torvalds @ 2010-04-12 16:26 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Johannes Weiner, KOSAKI Motohiro, Rik van Riel, Andrew Morton,
	Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn,
	Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson



On Mon, 12 Apr 2010, Linus Torvalds wrote:
> 
> Yes, and that is supposed to be a no-no. The page is clearly associated 
> with the vma in question (since we are unmapping it through that vma), but 
> the vma list of 'anon_vma's doesn't actually have the one that 
> 'page->mapping' points to.
> 
> And that, in turn, means that we've lost sight of the 'page->mapping' 
> anon_vma, and THAT in turn means that it could well have been free'd as 
> being no longer referenced.
> 
> And if it was free'd, it could be re-allocated as something else (after 
> the RCU grace period), and that directly explains your oops.

I have a new theory. And this new theory is completely different from all 
the other things we've been looking at.

The new theory is really simple: 'page->mapping' has been re-set to the 
wrong mapping. 

Now, there is one case where we reset page->mapping _intentionally_, 
namely in the COW-breaking case of having the last user 
("page_move_anon_rmap"). And that looks fine, and happens under normal 
loads all the time. We _want_ to do it there.

But there is a _much_ more subtle case that involved swapping.

So guys, here's my fairly simple theory on what happens:

 - page gets allocated/mapped by process A. Let's call the anon_vma we 
   associate the page with 'A' to keep it easy to track.

 - Process A forks, creating process B. The anon_vma in B is 'B', and has 
   a chain that looks like 'B' -> 'A'. Everything is fine.

 - Swapping happens. The page (with mapping pointing to 'A') gets swapped 
   out (perhaps not to disk - it's enough to assume that it's just not 
   mapped any more, and lives entirely in the swap-cache)

 - Process B pages it in, which goes like this:

	do_swap_page ->
	  page = lookup_swap_cache(entry);
	 ...
	  set_pte_at(mm, address, page_table, pte);
	  page_add_anon_rmap(page, vma, address);

   And think about what happens here!

In particular, what happens is that this will now be the "first" mapping 
of that page, so page_add_anon_rmap() will do

	if (first)
		__page_set_anon_rmap(page, vma, address);

and notice what anon_vma it will use? It will use the anon_vma for process 
B!

So now page->mapping actually points to anon_vma 'B', not 'A' like it used 
to.

What happens then? Trivial: process 'A' also pages it in (nothing happens, 
it's not the first mapping), and then process 'B' execve's or exits or 
unmaps, making anon_vma B go away.

End result: process A has a page that points to anon_vma B, but anon_vma B 
does not exist any more. This can go on forever. Forget about RCU grace 
periods, forget about locking, forget anything like that. The bug is 
simply that page->mapping points to an anon_vma that was correct at one 
point, but was _not_ the one that was shared by all users of that possible 
mapping.

The patch below is my largely mindless try at fixing this. It's untested. 
I'm not entirely sure that it actually works. But it makes some amount of 
conceptual sense. No?

			Linus

---
 mm/rmap.c |   15 +++++++++++++--
 1 files changed, 13 insertions(+), 2 deletions(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index ee97d38..4bad326 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -734,9 +734,20 @@ void page_move_anon_rmap(struct page *page,
 static void __page_set_anon_rmap(struct page *page,
 	struct vm_area_struct *vma, unsigned long address)
 {
-	struct anon_vma *anon_vma = vma->anon_vma;
+	struct anon_vma_chain *avc;
+	struct anon_vma *anon_vma;
+
+	BUG_ON(!vma->anon_vma);
+
+	/*
+	 * We must use the _oldest_ possible anon_vma for the page mapping!
+	 *
+	 * So take the last AVC chain entry in the vma, which is the deepest
+	 * ancestor, and use the anon_vma from that.
+	 */
+	avc = list_entry(vma->anon_vma_chain.prev, struct anon_vma_chain, same_vma);
+	anon_vma = avc->anon_vma;
 
-	BUG_ON(!anon_vma);
 	anon_vma = (void *) anon_vma + PAGE_MAPPING_ANON;
 	page->mapping = (struct address_space *) anon_vma;
 	page->index = linear_page_index(vma, address);

^ permalink raw reply related	[flat|nested] 242+ messages in thread

* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA
  2010-04-12 16:06                                                                                                                     ` Rik van Riel
@ 2010-04-12 16:46                                                                                                                       ` Linus Torvalds
  2010-04-12 18:40                                                                                                                         ` Peter Zijlstra
  0 siblings, 1 reply; 242+ messages in thread
From: Linus Torvalds @ 2010-04-12 16:46 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Peter Zijlstra, Borislav Petkov, Johannes Weiner,
	KOSAKI Motohiro, Andrew Morton, Minchan Kim,
	Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin,
	Andrea Arcangeli, Hugh Dickins, sgunderson



On Mon, 12 Apr 2010, Rik van Riel wrote:

> On 04/12/2010 12:01 PM, Peter Zijlstra wrote:
> 
> > @@ -864,15 +889,8 @@ void page_remove_rmap(struct page *page)
> >   		__dec_zone_page_state(page, NR_FILE_MAPPED);
> >   		mem_cgroup_update_file_mapped(page, -1);
> >   	}
> > -	/*
> > -	 * It would be tidy to reset the PageAnon mapping here,
> > -	 * but that might overwrite a racing page_add_anon_rmap
> > -	 * which increments mapcount after us but sets mapping
> > -	 * before us: so leave the reset to free_hot_cold_page,
> > -	 * and remember that it's only reliable while mapped.
> > -	 * Leaving it set also helps swapoff to reinstate ptes
> > -	 * faster for those pages still in swapcache.
> > -	 */
> > +
> > +	page->mapping = NULL;
> >   }
> 
> That would be a bug for file pages :)
> 
> I could see how it could work for anonymous memory, though.

I think it's scary for anonymous pages too. The _common_ case of 
page_remove_rmap() is from unmap/exit, which holds no locks on the page 
what-so-ever. So assuming the page could be reachable some other way (swap 
cache etc), I think the above is pretty scary. 

Also do note that the bug we've been chasing has _always_ had that test 
for "page_mapped(page)". See my other email about why the unmapped case 
isn't even interesting, because it's so easy to see how page->mapping can 
be stale for unmapped pages.

It's the _mapped_ case that is interesting, not the unmapped one. So 
setting page->mapping to NULL when unmapping is perhaps a nice consistency 
issue ("never have stale pointers"), but it's missing the fact that it's 
not really the case we care about.

			Linus

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA
  2010-04-12 16:26                                                                                                                                                 ` Linus Torvalds
@ 2010-04-12 18:40                                                                                                                                                   ` Rik van Riel
  2010-04-12 19:00                                                                                                                                                     ` Borislav Petkov
  2010-04-12 21:50                                                                                                                                                   ` [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA Borislav Petkov
  1 sibling, 1 reply; 242+ messages in thread
From: Rik van Riel @ 2010-04-12 18:40 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Borislav Petkov, Johannes Weiner, KOSAKI Motohiro, Andrew Morton,
	Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn,
	Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson

On 04/12/2010 12:26 PM, Linus Torvalds wrote:

> But there is a _much_ more subtle case that involved swapping.
>
> So guys, here's my fairly simple theory on what happens:

That bug looks entirely possible.  Given that Borislav
has heavy swapping going on, it is quite possible that
this is the bug he has been triggering.

> The patch below is my largely mindless try at fixing this. It's untested.
> I'm not entirely sure that it actually works. But it makes some amount of
> conceptual sense. No?

The patch would help avoid the bug you described.

It does have the drawback of moving all the pages of
child processes back into the anon_vma of the parent
process after swapin, even if they are privately owned
pages by the child process.

I am guessing it may need a check to see whether the
page and swap slot are exclusively owned by the current
process.

Page or swap slot shared?      => oldest anon_vma
Page and swap slot exclusive?  => newest anon_vma

I suspect the easiest way to achieve this would be to
pass a flag in from do_swap_page, where we already
check this, a few lines above calling page_add_anon_rmap:

         if ((flags & FAULT_FLAG_WRITE) && reuse_swap_page(page)) {
                 pte = maybe_mkwrite(pte_mkdirty(pte), vma);
                 flags &= ~FAULT_FLAG_WRITE;
         }




^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA
  2010-04-12 16:46                                                                                                                       ` Linus Torvalds
@ 2010-04-12 18:40                                                                                                                         ` Peter Zijlstra
  2010-04-12 19:30                                                                                                                           ` Peter Zijlstra
  0 siblings, 1 reply; 242+ messages in thread
From: Peter Zijlstra @ 2010-04-12 18:40 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Rik van Riel, Borislav Petkov, Johannes Weiner, KOSAKI Motohiro,
	Andrew Morton, Minchan Kim, Linux Kernel Mailing List,
	Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins,
	sgunderson

On Mon, 2010-04-12 at 09:46 -0700, Linus Torvalds wrote: 
> 
> On Mon, 12 Apr 2010, Rik van Riel wrote:
> 
> > On 04/12/2010 12:01 PM, Peter Zijlstra wrote:
> > 
> > > @@ -864,15 +889,8 @@ void page_remove_rmap(struct page *page)
> > >   		__dec_zone_page_state(page, NR_FILE_MAPPED);
> > >   		mem_cgroup_update_file_mapped(page, -1);
> > >   	}
> > > -	/*
> > > -	 * It would be tidy to reset the PageAnon mapping here,
> > > -	 * but that might overwrite a racing page_add_anon_rmap
> > > -	 * which increments mapcount after us but sets mapping
> > > -	 * before us: so leave the reset to free_hot_cold_page,
> > > -	 * and remember that it's only reliable while mapped.
> > > -	 * Leaving it set also helps swapoff to reinstate ptes
> > > -	 * faster for those pages still in swapcache.
> > > -	 */
> > > +
> > > +	page->mapping = NULL;
> > >   }
> > 
> > That would be a bug for file pages :)
> > 
> > I could see how it could work for anonymous memory, though.
> 
> I think it's scary for anonymous pages too. The _common_ case of 
> page_remove_rmap() is from unmap/exit, which holds no locks on the page 
> what-so-ever. So assuming the page could be reachable some other way (swap 
> cache etc), I think the above is pretty scary. 

Fully agreed.

> Also do note that the bug we've been chasing has _always_ had that test 
> for "page_mapped(page)". See my other email about why the unmapped case 
> isn't even interesting, because it's so easy to see how page->mapping can 
> be stale for unmapped pages.
> 
> It's the _mapped_ case that is interesting, not the unmapped one. So 
> setting page->mapping to NULL when unmapping is perhaps a nice consistency 
> issue ("never have stale pointers"), but it's missing the fact that it's 
> not really the case we care about.

Yes, I don't think this is the problem that has been plaguing us for
over a week now.

But while staring at that code it did get me worried that the current
code (page_lock_anon_vma):

- is missing the smp_read_barrier_depends() after the ACCESS_ONCE
- isn't properly ordered wrt page->mapping and page->_mapcount.
- doesn't appear to guarantee much at all when returning an anon_vma
  since it locks after checking page->_mapcount so:
    * it can return !NULL for an unmapped page (your patch cures that)
    * it can return !NULL but for a different anon_vma
      (my earlier patch checking page_rmapping() after the spin_lock
       cures that, but doesn't cure the above):

        [ highly unlikely but not impossible race ]

        page_referenced(page_A)

			try_to_unmap(page_A)

					unrelated fault

							fault page_A

	CPU0		CPU1		CPU2		CPU3

	rcu_read_lock()
	anon_vma = page->mapping;
	if (!anon_vma & ANON_BIT)
	  goto out
	if (!page_mapped(page))
	  goto out

			page_remove_rmap()
			...
			anon_vma_free()-----\
					    v
					anon_vma_alloc()
					
							anon_vma_alloc()
							page_add_anon_rmap()
					   ^
	spin_lock(anon_vma->lock)----------/


    Now I don't think the above can happen due to how our slab
    allocators work, they won't share a slab page between cpus like
    that, but once we make the whole thing preemptible this race
    becomes a lot more likely.


So a page_lock_anon_vma(), that looks a little like the below should
(I think) cure all our problems with it.


struct anon_vma *page_lock_anon_vma(struct page *page)
{
	struct anon_vma *anon_vma;
	unsigned long anon_mapping;

	rcu_read_lock();
again:
	anon_mapping = (unsigned long)rcu_dereference(page->mapping);
	if ((anon_mapping & PAGE_MAPPING_FLAGS) != PAGE_MAPPING_ANON)
		goto out;
	anon_vma = (struct anon_vma *)(anon_mapping - PAGE_MAPPING_ANON);

	/*
	 * The RCU read lock ensures we can safely dereference anon_vma
	 * since it ensures the backing slab won't go away. It will however
	 * not guarantee it's the right object.
	 *
	 * First take the anon_vma->lock, this will, per anon_vma_unlink()
	 * avoid this anon_vma from being freed if it is a valid object.
	 */
	spin_lock(&anon_vma->lock);

	/*
	 * Secondly, we have to re-read page->mapping, so ensure it
	 * has not changed, rely on spin_lock() being at least a
	 * compiler barrier to force the re-read.
	 */
	if (unlikely(page_rmapping(page) != anon_vma)) {
		spin_unlock(&anon_vma->lock);
		goto again;
	}

	/*
	 * Ensure we read page->mapping before page->_mapcount,
	 * orders against atomic_add_negative() in page_remove_rmap().
	 */
	smp_rmb();

	/*
	 * Finally check that the page is still mapped,
	 * if not, this can't possibly be the right anon_vma.
	 */
	if (!page_mapped(page))
		goto unlock;

	return anon_vma;

unlock:
	spin_unlock(&anon_vma->lock);
out:
	rcu_read_unlock();
	return NULL;
}


With this, I think we can actually drop the RCU read lock when returning
since if this is indeed a valid anon_vma for this page, then the page is
still mapped, and hence the anon_vma was not deleted, and a possible
future delete will be held back by us holding the anon_vma->lock.

Now I could be totally wrong and have confused myself throroughly, but
how does this look?

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA
  2010-04-12 18:40                                                                                                                                                   ` Rik van Riel
@ 2010-04-12 19:00                                                                                                                                                     ` Borislav Petkov
  2010-04-12 19:17                                                                                                                                                       ` Linus Torvalds
  0 siblings, 1 reply; 242+ messages in thread
From: Borislav Petkov @ 2010-04-12 19:00 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Linus Torvalds, Johannes Weiner, KOSAKI Motohiro, Andrew Morton,
	Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn,
	Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson

From: Rik van Riel <riel@redhat.com>
Date: Mon, Apr 12, 2010 at 02:40:22PM -0400

> On 04/12/2010 12:26 PM, Linus Torvalds wrote:
> 
> >But there is a _much_ more subtle case that involved swapping.
> >
> >So guys, here's my fairly simple theory on what happens:
> 
> That bug looks entirely possible.  Given that Borislav
> has heavy swapping going on, it is quite possible that
> this is the bug he has been triggering.

Yeah, about that. I dunno whether you guys saw that but the machine has
8Gb of RAM and shouldn't be swapping, AFAIK. The largest mem usage I
saw was 5Gb used, most of which pagecache. So I was kinda doubtful when
Linus came up with the swapping theory earlier. I'll pay attention to
the SwapCached in /proc/meminfo more to see whether we do any swapping.
It could be that there is a small amount which is swapped out for
whatever reason... Maybe that's the bug...

But I'll give the patch a run anyway in an hour or so anyway.

-- 
Regards/Gruss,
Boris.

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA
  2010-04-12 19:00                                                                                                                                                     ` Borislav Petkov
@ 2010-04-12 19:17                                                                                                                                                       ` Linus Torvalds
  2010-04-12 20:22                                                                                                                                                         ` [PATCH 1/4] Simplify and comment on anon_vma re-use for anon_vma_prepare() Linus Torvalds
  0 siblings, 1 reply; 242+ messages in thread
From: Linus Torvalds @ 2010-04-12 19:17 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Rik van Riel, Johannes Weiner, KOSAKI Motohiro, Andrew Morton,
	Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn,
	Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson



On Mon, 12 Apr 2010, Borislav Petkov wrote:
> 
> But I'll give the patch a run anyway in an hour or so anyway.

Thanks. I suspect you will find that even if there is no actual disk IO 
swapping going on during any of the normal loads, the shrink_all_memory() 
thing in your hibernation event will cause swap to happen. Or at least 
swap-cache entries to be done.

Oh, and I've decided that my rcu_read_lock() patch for the tlb_gather() 
thing for unmapping is bogus. Exactly because the critical issue isn't 
when the page is free'd (and page->mapping is cleared), but when the page 
is unmapped (and page_mapped() clears).

And that is done correctly even with the delayed frees in tlb_gather. So 
addign the rcu_read_lock/rcu_read_unlock around it all doesn't actually 
matter or help.

So the patches that I think fix real bugs are

 - the anon_vma_prepare() fix to only share anon_vma's if they are 
   singletons.

 - the vma_adjust() fix to copy the right anon_vma chains

 - the anon_vma_clone() fix to traverse the avc's in reverse order, so 
   that the resulting cloned chain is the same as the original chain

   You got this patch as part of the "verify_vma()" patch, but the only 
   part of that patch that matters is the one-liner that changes a 
   "for_each_list_entry" to use the "_reverse()" version..

 - and that last patch to pick the right anon_vma when mapping a page 
   (which could still be improved: the "insert new page" case does _not_ 
   have to take the oldest anon_vma, and Rik is correct that if we have an 
   exclusive swap cache entry we could also take the top one)

I think I'll re-post all four patches with real commit messages, to get 
ack's for them. I'd like to finally get the much delayed -rc4 out the 
door.

Oh, and if that "pick the right anon_vma" patch doesn't fix it, I suspect 
we'll have to revert the whole anon_vma changes for 2.6.34. It's getting 
pretty late in the -rc series to fix this bug. I'm _hoping_ that I really 
nailed it this time, and that we're ok, but if Borislav reports it still 
happening, and people not having any other ideas, I think I'll just have 
to do an -rc4 with it all reverted, and then we can try again for 35 if 
somebody figures out the bug.

Hmm? I'd hate to revert it all now because of the hours I've put in 
looking at the code (to the point that I feel I understand it), but at the 
same time, if it was somebody else who was chasing this bug and not being 
able to fix it, I'd tell them "revert it, it's too late". Amount of effort 
spent doesn't matter if the bug still happens ;^(

		Linus

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA
  2010-04-12 18:40                                                                                                                         ` Peter Zijlstra
@ 2010-04-12 19:30                                                                                                                           ` Peter Zijlstra
  2010-04-12 19:44                                                                                                                             ` Peter Zijlstra
  0 siblings, 1 reply; 242+ messages in thread
From: Peter Zijlstra @ 2010-04-12 19:30 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Rik van Riel, Borislav Petkov, Johannes Weiner, KOSAKI Motohiro,
	Andrew Morton, Minchan Kim, Linux Kernel Mailing List,
	Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins,
	sgunderson

On Mon, 2010-04-12 at 20:40 +0200, Peter Zijlstra wrote:

Hmm, if interleaved like so

> struct anon_vma *page_lock_anon_vma(struct page *page)
> {
>         struct anon_vma *anon_vma;
>         unsigned long anon_mapping;

page_remove_rmap()
anon_vma_unlink()
  anon_vma_free()

So that the below will all observe the old page->mapping:

>         rcu_read_lock();
> again:
>         anon_mapping = (unsigned long)rcu_dereference(page->mapping);
>         if ((anon_mapping & PAGE_MAPPING_FLAGS) != PAGE_MAPPING_ANON)
>                 goto out;
>         anon_vma = (struct anon_vma *)(anon_mapping - PAGE_MAPPING_ANON);
> 
>         /*
>          * The RCU read lock ensures we can safely dereference anon_vma
>          * since it ensures the backing slab won't go away. It will however
>          * not guarantee it's the right object.
>          *
>          * First take the anon_vma->lock, this will, per anon_vma_unlink()
>          * avoid this anon_vma from being freed if it is a valid object.
>          */
>         spin_lock(&anon_vma->lock);
> 
>         /*
>          * Secondly, we have to re-read page->mapping, so ensure it
>          * has not changed, rely on spin_lock() being at least a
>          * compiler barrier to force the re-read.
>          */
>         if (unlikely(page_rmapping(page) != anon_vma)) {
>                 spin_unlock(&anon_vma->lock);
>                 goto again;
>         }

page_add_anon_rmap(), so that the page_mapped() test below would be
positive,

>         /*
>          * Ensure we read page->mapping before page->_mapcount,
>          * orders against atomic_add_negative() in page_remove_rmap().
>          */
>         smp_rmb();
> 
>         /*
>          * Finally check that the page is still mapped,
>          * if not, this can't possibly be the right anon_vma.
>          */
>         if (!page_mapped(page))
>                 goto unlock;

We could here return a non-valid and already freed anon_vma.

>         return anon_vma;
> 
> unlock:
>         spin_unlock(&anon_vma->lock);
> out:
>         rcu_read_unlock();
>         return NULL;
> }
> 
> 


^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA
  2010-04-12 19:30                                                                                                                           ` Peter Zijlstra
@ 2010-04-12 19:44                                                                                                                             ` Peter Zijlstra
  0 siblings, 0 replies; 242+ messages in thread
From: Peter Zijlstra @ 2010-04-12 19:44 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Rik van Riel, Borislav Petkov, Johannes Weiner, KOSAKI Motohiro,
	Andrew Morton, Minchan Kim, Linux Kernel Mailing List,
	Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins,
	sgunderson

On Mon, 2010-04-12 at 21:30 +0200, Peter Zijlstra wrote:
> 
> We could here return a non-valid and already freed anon_vma.
> 
OK, so non of the users of page_lock_anon_vma() with exception of the
memory-failure.c one could really care. And all of them seem to be safe
enough wrt dealing with a dead one.

So unless people care, I'm going to not spend more time on trying to
make page_lock_anon_vma() behave.

Instead I'll try and see wth it is that migrate.c and rmap_walk_anon are
doing.


^ permalink raw reply	[flat|nested] 242+ messages in thread

* [PATCH 1/4] Simplify and comment on anon_vma re-use for anon_vma_prepare()
  2010-04-12 19:17                                                                                                                                                       ` Linus Torvalds
@ 2010-04-12 20:22                                                                                                                                                         ` Linus Torvalds
  2010-04-12 20:23                                                                                                                                                           ` [PATCH 2/4] vma_adjust: fix the copying of anon_vma chains Linus Torvalds
                                                                                                                                                                             ` (4 more replies)
  0 siblings, 5 replies; 242+ messages in thread
From: Linus Torvalds @ 2010-04-12 20:22 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Rik van Riel, Johannes Weiner, KOSAKI Motohiro, Andrew Morton,
	Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn,
	Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson


From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Sat, 10 Apr 2010 10:36:19 -0700
Subject: [PATCH 1/4] Simplify and comment on anon_vma re-use for anon_vma_prepare()

This changes the anon_vma reuse case to require that we only reuse
simple anon_vma's - ie the case when the vma only has a single anon_vma
associated with it.

This means that a reuse of an anon_vma from an adjacent vma will always
guarantee that both vma's are associated not onyl with the same
anon_vma, they will also have the same anon_vma chain (of just a single
entry in this case).

And since anon_vma re-use was the only case where the same anon_vma
might be associated with different chains of anon_vma's, we now have the
case that every vma that shares the same vma will always also have the
same chain.  That makes it much easier to think about merging vma's that
share the same anon_vma's: you can always just drop the other anon_vma
chain in anon_vma_merge() since you know that they are always identical.

This also splits up the function to validate the anon_vma re-use, and
adds a lot of commentary about the possible races.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---

Ok, so I'm sending out this series of four patches, in the (perhaps 
futile) hope that they will finally fix the problem that Borislav has been 
so great at reporting.

I'd like to gather ack's, nak's and perhaps changelog improvement 
suggestions while doing this.

 mm/mmap.c |   86 ++++++++++++++++++++++++++++++++++++++++++++-----------------
 1 files changed, 62 insertions(+), 24 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index 75557c6..acb023e 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -825,6 +825,61 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm,
 }
 
 /*
+ * Rough compatbility check to quickly see if it's even worth looking
+ * at sharing an anon_vma.
+ *
+ * They need to have the same vm_file, and the flags can only differ
+ * in things that mprotect may change.
+ *
+ * NOTE! The fact that we share an anon_vma doesn't _have_ to mean that
+ * we can merge the two vma's. For example, we refuse to merge a vma if
+ * there is a vm_ops->close() function, because that indicates that the
+ * driver is doing some kind of reference counting. But that doesn't
+ * really matter for the anon_vma sharing case.
+ */
+static int anon_vma_compatible(struct vm_area_struct *a, struct vm_area_struct *b)
+{
+	return a->vm_end == b->vm_start &&
+		mpol_equal(vma_policy(a), vma_policy(b)) &&
+		a->vm_file == b->vm_file &&
+		!((a->vm_flags ^ b->vm_flags) & ~(VM_READ|VM_WRITE|VM_EXEC)) &&
+		b->vm_pgoff == a->vm_pgoff + ((b->vm_start - a->vm_start) >> PAGE_SHIFT);
+}
+
+/*
+ * Do some basic sanity checking to see if we can re-use the anon_vma
+ * from 'old'. The 'a'/'b' vma's are in VM order - one of them will be
+ * the same as 'old', the other will be the new one that is trying
+ * to share the anon_vma.
+ *
+ * NOTE! This runs with mm_sem held for reading, so it is possible that
+ * the anon_vma of 'old' is concurrently in the process of being set up
+ * by another page fault trying to merge _that_. But that's ok: if it
+ * is being set up, that automatically means that it will be a singleton
+ * acceptable for merging, so we can do all of this optimistically. But
+ * we do that ACCESS_ONCE() to make sure that we never re-load the pointer.
+ *
+ * IOW: that the "list_is_singular()" test on the anon_vma_chain only
+ * matters for the 'stable anon_vma' case (ie the thing we want to avoid
+ * is to return an anon_vma that is "complex" due to having gone through
+ * a fork).
+ *
+ * We also make sure that the two vma's are compatible (adjacent,
+ * and with the same memory policies). That's all stable, even with just
+ * a read lock on the mm_sem.
+ */
+static struct anon_vma *reusable_anon_vma(struct vm_area_struct *old, struct vm_area_struct *a, struct vm_area_struct *b)
+{
+	if (anon_vma_compatible(a, b)) {
+		struct anon_vma *anon_vma = ACCESS_ONCE(old->anon_vma);
+
+		if (anon_vma && list_is_singular(&old->anon_vma_chain))
+			return anon_vma;
+	}
+	return NULL;
+}
+
+/*
  * find_mergeable_anon_vma is used by anon_vma_prepare, to check
  * neighbouring vmas for a suitable anon_vma, before it goes off
  * to allocate a new anon_vma.  It checks because a repetitive
@@ -834,28 +889,16 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm,
  */
 struct anon_vma *find_mergeable_anon_vma(struct vm_area_struct *vma)
 {
+	struct anon_vma *anon_vma;
 	struct vm_area_struct *near;
-	unsigned long vm_flags;
 
 	near = vma->vm_next;
 	if (!near)
 		goto try_prev;
 
-	/*
-	 * Since only mprotect tries to remerge vmas, match flags
-	 * which might be mprotected into each other later on.
-	 * Neither mlock nor madvise tries to remerge at present,
-	 * so leave their flags as obstructing a merge.
-	 */
-	vm_flags = vma->vm_flags & ~(VM_READ|VM_WRITE|VM_EXEC);
-	vm_flags |= near->vm_flags & (VM_READ|VM_WRITE|VM_EXEC);
-
-	if (near->anon_vma && vma->vm_end == near->vm_start &&
- 			mpol_equal(vma_policy(vma), vma_policy(near)) &&
-			can_vma_merge_before(near, vm_flags,
-				NULL, vma->vm_file, vma->vm_pgoff +
-				((vma->vm_end - vma->vm_start) >> PAGE_SHIFT)))
-		return near->anon_vma;
+	anon_vma = reusable_anon_vma(near, vma, near);
+	if (anon_vma)
+		return anon_vma;
 try_prev:
 	/*
 	 * It is potentially slow to have to call find_vma_prev here.
@@ -868,14 +911,9 @@ try_prev:
 	if (!near)
 		goto none;
 
-	vm_flags = vma->vm_flags & ~(VM_READ|VM_WRITE|VM_EXEC);
-	vm_flags |= near->vm_flags & (VM_READ|VM_WRITE|VM_EXEC);
-
-	if (near->anon_vma && near->vm_end == vma->vm_start &&
-  			mpol_equal(vma_policy(near), vma_policy(vma)) &&
-			can_vma_merge_after(near, vm_flags,
-				NULL, vma->vm_file, vma->vm_pgoff))
-		return near->anon_vma;
+	anon_vma = reusable_anon_vma(near, near, vma);
+	if (anon_vma)
+		return anon_vma;
 none:
 	/*
 	 * There's no absolute need to look only at touching neighbours:
-- 
1.7.1.rc1.dirty


^ permalink raw reply related	[flat|nested] 242+ messages in thread

* [PATCH 2/4] vma_adjust: fix the copying of anon_vma chains
  2010-04-12 20:22                                                                                                                                                         ` [PATCH 1/4] Simplify and comment on anon_vma re-use for anon_vma_prepare() Linus Torvalds
@ 2010-04-12 20:23                                                                                                                                                           ` Linus Torvalds
  2010-04-12 20:23                                                                                                                                                             ` [PATCH 3/4] anon_vma: clone the anon_vma chain in the right order Linus Torvalds
                                                                                                                                                                               ` (3 more replies)
  2010-04-12 20:54                                                                                                                                                           ` [PATCH 1/4] Simplify and comment on anon_vma re-use for anon_vma_prepare() Rik van Riel
                                                                                                                                                                             ` (3 subsequent siblings)
  4 siblings, 4 replies; 242+ messages in thread
From: Linus Torvalds @ 2010-04-12 20:23 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Rik van Riel, Johannes Weiner, KOSAKI Motohiro, Andrew Morton,
	Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn,
	Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson


From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Sat, 10 Apr 2010 15:22:30 -0700
Subject: [PATCH 2/4] vma_adjust: fix the copying of anon_vma chains

When we move the boundaries between two vma's due to things like
mprotect, we need to make sure that the anon_vma of the pages that got
moved from one vma to another gets properly copied around.  And that was
not always the case, in this rather hard-to-follow code sequence.

Clarify the code, and fix it so that it copies the anon_vma from the
right source.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---
 mm/mmap.c |   24 ++++++++----------------
 1 files changed, 8 insertions(+), 16 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index acb023e..f90ea92 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -507,11 +507,12 @@ int vma_adjust(struct vm_area_struct *vma, unsigned long start,
 	struct address_space *mapping = NULL;
 	struct prio_tree_root *root = NULL;
 	struct file *file = vma->vm_file;
-	struct anon_vma *anon_vma = NULL;
 	long adjust_next = 0;
 	int remove_next = 0;
 
 	if (next && !insert) {
+		struct vm_area_struct *exporter = NULL;
+
 		if (end >= next->vm_end) {
 			/*
 			 * vma expands, overlapping all the next, and
@@ -519,7 +520,7 @@ int vma_adjust(struct vm_area_struct *vma, unsigned long start,
 			 */
 again:			remove_next = 1 + (end > next->vm_end);
 			end = next->vm_end;
-			anon_vma = next->anon_vma;
+			exporter = next;
 			importer = vma;
 		} else if (end > next->vm_start) {
 			/*
@@ -527,7 +528,7 @@ again:			remove_next = 1 + (end > next->vm_end);
 			 * mprotect case 5 shifting the boundary up.
 			 */
 			adjust_next = (end - next->vm_start) >> PAGE_SHIFT;
-			anon_vma = next->anon_vma;
+			exporter = next;
 			importer = vma;
 		} else if (end < vma->vm_end) {
 			/*
@@ -536,28 +537,19 @@ again:			remove_next = 1 + (end > next->vm_end);
 			 * mprotect case 4 shifting the boundary down.
 			 */
 			adjust_next = - ((vma->vm_end - end) >> PAGE_SHIFT);
-			anon_vma = next->anon_vma;
+			exporter = vma;
 			importer = next;
 		}
-	}
 
-	/*
-	 * When changing only vma->vm_end, we don't really need anon_vma lock.
-	 */
-	if (vma->anon_vma && (insert || importer || start != vma->vm_start))
-		anon_vma = vma->anon_vma;
-	if (anon_vma) {
 		/*
 		 * Easily overlooked: when mprotect shifts the boundary,
 		 * make sure the expanding vma has anon_vma set if the
 		 * shrinking vma had, to cover any anon pages imported.
 		 */
-		if (importer && !importer->anon_vma) {
-			/* Block reverse map lookups until things are set up. */
-			if (anon_vma_clone(importer, vma)) {
+		if (exporter && exporter->anon_vma && !importer->anon_vma) {
+			if (anon_vma_clone(importer, exporter))
 				return -ENOMEM;
-			}
-			importer->anon_vma = anon_vma;
+			importer->anon_vma = exporter->anon_vma;
 		}
 	}
 
-- 
1.7.1.rc1.dirty


^ permalink raw reply related	[flat|nested] 242+ messages in thread

* [PATCH 3/4] anon_vma: clone the anon_vma chain in the right order
  2010-04-12 20:23                                                                                                                                                           ` [PATCH 2/4] vma_adjust: fix the copying of anon_vma chains Linus Torvalds
@ 2010-04-12 20:23                                                                                                                                                             ` Linus Torvalds
  2010-04-12 20:23                                                                                                                                                               ` [PATCH 4/4] anonvma: when setting up page->mapping, we need to pick the _oldest_ anonvma Linus Torvalds
                                                                                                                                                                                 ` (3 more replies)
  2010-04-12 20:54                                                                                                                                                             ` [PATCH 2/4] vma_adjust: fix the copying of anon_vma chains Rik van Riel
                                                                                                                                                                               ` (2 subsequent siblings)
  3 siblings, 4 replies; 242+ messages in thread
From: Linus Torvalds @ 2010-04-12 20:23 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Rik van Riel, Johannes Weiner, KOSAKI Motohiro, Andrew Morton,
	Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn,
	Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson


From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Sun, 11 Apr 2010 17:15:03 -0700
Subject: [PATCH 3/4] anon_vma: clone the anon_vma chain in the right order

We want to walk the chain in reverse order when cloning it, so that the
order of the result chain will be the same as the order in the source
chain.  When we add entries to the chain, they go at the head of the
chain, so we want to add the source head last.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---
 mm/rmap.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index eaa7a09..ee97d38 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -182,7 +182,7 @@ int anon_vma_clone(struct vm_area_struct *dst, struct vm_area_struct *src)
 {
 	struct anon_vma_chain *avc, *pavc;
 
-	list_for_each_entry(pavc, &src->anon_vma_chain, same_vma) {
+	list_for_each_entry_reverse(pavc, &src->anon_vma_chain, same_vma) {
 		avc = anon_vma_chain_alloc();
 		if (!avc)
 			goto enomem_failure;
-- 
1.7.1.rc1.dirty


^ permalink raw reply related	[flat|nested] 242+ messages in thread

* [PATCH 4/4] anonvma: when setting up page->mapping, we need to pick the _oldest_ anonvma
  2010-04-12 20:23                                                                                                                                                             ` [PATCH 3/4] anon_vma: clone the anon_vma chain in the right order Linus Torvalds
@ 2010-04-12 20:23                                                                                                                                                               ` Linus Torvalds
  2010-04-12 21:03                                                                                                                                                                 ` Rik van Riel
  2010-04-13  0:41                                                                                                                                                                 ` Johannes Weiner
  2010-04-12 20:57                                                                                                                                                               ` [PATCH 3/4] anon_vma: clone the anon_vma chain in the right order Rik van Riel
                                                                                                                                                                                 ` (2 subsequent siblings)
  3 siblings, 2 replies; 242+ messages in thread
From: Linus Torvalds @ 2010-04-12 20:23 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Rik van Riel, Johannes Weiner, KOSAKI Motohiro, Andrew Morton,
	Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn,
	Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson


From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Mon, 12 Apr 2010 12:44:29 -0700
Subject: [PATCH 4/4] anonvma: when setting up page->mapping, we need to pick the _oldest_ anonvma

Otherwise we might be mapping in a page in a new mapping, but that page
(through the swapcache) would later be mapped into an old mapping too.
The page->mapping must be the case that works for everybody, not just
the mapping that happened to page it in first.

This can be improved in certain cases: if we know the page is private to
just this particular mapping (for example, it's a new page, or it is the
only swapcache entry), we could pick the top (most specific) anon_vma.

But that's a future optimization. Make it _work_ reliably first.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---
 mm/rmap.c |   15 +++++++++++++--
 1 files changed, 13 insertions(+), 2 deletions(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index ee97d38..4bad326 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -734,9 +734,20 @@ void page_move_anon_rmap(struct page *page,
 static void __page_set_anon_rmap(struct page *page,
 	struct vm_area_struct *vma, unsigned long address)
 {
-	struct anon_vma *anon_vma = vma->anon_vma;
+	struct anon_vma_chain *avc;
+	struct anon_vma *anon_vma;
+
+	BUG_ON(!vma->anon_vma);
+
+	/*
+	 * We must use the _oldest_ possible anon_vma for the page mapping!
+	 *
+	 * So take the last AVC chain entry in the vma, which is the deepest
+	 * ancestor, and use the anon_vma from that.
+	 */
+	avc = list_entry(vma->anon_vma_chain.prev, struct anon_vma_chain, same_vma);
+	anon_vma = avc->anon_vma;
 
-	BUG_ON(!anon_vma);
 	anon_vma = (void *) anon_vma + PAGE_MAPPING_ANON;
 	page->mapping = (struct address_space *) anon_vma;
 	page->index = linear_page_index(vma, address);
-- 
1.7.1.rc1.dirty


^ permalink raw reply related	[flat|nested] 242+ messages in thread

* Re: [PATCH 1/4] Simplify and comment on anon_vma re-use for anon_vma_prepare()
  2010-04-12 20:22                                                                                                                                                         ` [PATCH 1/4] Simplify and comment on anon_vma re-use for anon_vma_prepare() Linus Torvalds
  2010-04-12 20:23                                                                                                                                                           ` [PATCH 2/4] vma_adjust: fix the copying of anon_vma chains Linus Torvalds
@ 2010-04-12 20:54                                                                                                                                                           ` Rik van Riel
  2010-04-12 23:54                                                                                                                                                           ` Johannes Weiner
                                                                                                                                                                             ` (2 subsequent siblings)
  4 siblings, 0 replies; 242+ messages in thread
From: Rik van Riel @ 2010-04-12 20:54 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Borislav Petkov, Johannes Weiner, KOSAKI Motohiro, Andrew Morton,
	Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn,
	Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson

On 04/12/2010 04:22 PM, Linus Torvalds wrote:
>
> From: Linus Torvalds<torvalds@linux-foundation.org>
> Date: Sat, 10 Apr 2010 10:36:19 -0700
> Subject: [PATCH 1/4] Simplify and comment on anon_vma re-use for anon_vma_prepare()

> Signed-off-by: Linus Torvalds<torvalds@linux-foundation.org>

Reviewed-by: Rik van Riel <riel@redhat.com>

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: [PATCH 2/4] vma_adjust: fix the copying of anon_vma chains
  2010-04-12 20:23                                                                                                                                                           ` [PATCH 2/4] vma_adjust: fix the copying of anon_vma chains Linus Torvalds
  2010-04-12 20:23                                                                                                                                                             ` [PATCH 3/4] anon_vma: clone the anon_vma chain in the right order Linus Torvalds
@ 2010-04-12 20:54                                                                                                                                                             ` Rik van Riel
  2010-04-12 23:59                                                                                                                                                             ` Johannes Weiner
  2010-04-13  4:15                                                                                                                                                             ` Minchan Kim
  3 siblings, 0 replies; 242+ messages in thread
From: Rik van Riel @ 2010-04-12 20:54 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Borislav Petkov, Johannes Weiner, KOSAKI Motohiro, Andrew Morton,
	Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn,
	Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson

On 04/12/2010 04:23 PM, Linus Torvalds wrote:
>
> From: Linus Torvalds<torvalds@linux-foundation.org>
> Date: Sat, 10 Apr 2010 15:22:30 -0700
> Subject: [PATCH 2/4] vma_adjust: fix the copying of anon_vma chains
>
> When we move the boundaries between two vma's due to things like
> mprotect, we need to make sure that the anon_vma of the pages that got
> moved from one vma to another gets properly copied around.  And that was
> not always the case, in this rather hard-to-follow code sequence.
>
> Clarify the code, and fix it so that it copies the anon_vma from the
> right source.
>
> Signed-off-by: Linus Torvalds<torvalds@linux-foundation.org>

Reviewed-by: Rik van Riel <riel@redhat.com>


^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: [PATCH 3/4] anon_vma: clone the anon_vma chain in the right order
  2010-04-12 20:23                                                                                                                                                             ` [PATCH 3/4] anon_vma: clone the anon_vma chain in the right order Linus Torvalds
  2010-04-12 20:23                                                                                                                                                               ` [PATCH 4/4] anonvma: when setting up page->mapping, we need to pick the _oldest_ anonvma Linus Torvalds
@ 2010-04-12 20:57                                                                                                                                                               ` Rik van Riel
  2010-04-13  0:18                                                                                                                                                               ` Johannes Weiner
  2010-04-13  4:16                                                                                                                                                               ` Minchan Kim
  3 siblings, 0 replies; 242+ messages in thread
From: Rik van Riel @ 2010-04-12 20:57 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Borislav Petkov, Johannes Weiner, KOSAKI Motohiro, Andrew Morton,
	Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn,
	Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson

On 04/12/2010 04:23 PM, Linus Torvalds wrote:
>
> From: Linus Torvalds<torvalds@linux-foundation.org>
> Date: Sun, 11 Apr 2010 17:15:03 -0700
> Subject: [PATCH 3/4] anon_vma: clone the anon_vma chain in the right order
>
> We want to walk the chain in reverse order when cloning it, so that the
> order of the result chain will be the same as the order in the source
> chain.  When we add entries to the chain, they go at the head of the
> chain, so we want to add the source head last.
>
> Signed-off-by: Linus Torvalds<torvalds@linux-foundation.org>

Reviewed-by: Rik van Riel <riel@redhat.com>


^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: [PATCH 4/4] anonvma: when setting up page->mapping, we need to pick the _oldest_ anonvma
  2010-04-12 20:23                                                                                                                                                               ` [PATCH 4/4] anonvma: when setting up page->mapping, we need to pick the _oldest_ anonvma Linus Torvalds
@ 2010-04-12 21:03                                                                                                                                                                 ` Rik van Riel
  2010-04-13  0:41                                                                                                                                                                 ` Johannes Weiner
  1 sibling, 0 replies; 242+ messages in thread
From: Rik van Riel @ 2010-04-12 21:03 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Borislav Petkov, Johannes Weiner, KOSAKI Motohiro, Andrew Morton,
	Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn,
	Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson

On 04/12/2010 04:23 PM, Linus Torvalds wrote:
>
> From: Linus Torvalds<torvalds@linux-foundation.org>
> Date: Mon, 12 Apr 2010 12:44:29 -0700
> Subject: [PATCH 4/4] anonvma: when setting up page->mapping, we need to pick the _oldest_ anonvma
>
> Otherwise we might be mapping in a page in a new mapping, but that page
> (through the swapcache) would later be mapped into an old mapping too.
> The page->mapping must be the case that works for everybody, not just
> the mapping that happened to page it in first.
>
> This can be improved in certain cases: if we know the page is private to
> just this particular mapping (for example, it's a new page, or it is the
> only swapcache entry), we could pick the top (most specific) anon_vma.
>
> But that's a future optimization. Make it _work_ reliably first.

Agreed.  I'll send an incremental for that later, you
can judge whether or not it's something you'll want to
merge before or after 2.6.34

> Signed-off-by: Linus Torvalds<torvalds@linux-foundation.org>

Reviewed-by: Rik van Riel <riel@redhat.com>

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA
  2010-04-12 16:26                                                                                                                                                 ` Linus Torvalds
  2010-04-12 18:40                                                                                                                                                   ` Rik van Riel
@ 2010-04-12 21:50                                                                                                                                                   ` Borislav Petkov
  2010-04-12 22:11                                                                                                                                                     ` Linus Torvalds
  1 sibling, 1 reply; 242+ messages in thread
From: Borislav Petkov @ 2010-04-12 21:50 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Johannes Weiner, KOSAKI Motohiro, Rik van Riel, Andrew Morton,
	Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn,
	Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson

From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Mon, Apr 12, 2010 at 09:26:57AM -0700

> I have a new theory. And this new theory is completely different from all 
> the other things we've been looking at.

Yeah, because all starts with "I have a new theory..." :o)

> The patch below is my largely mindless try at fixing this. It's untested. 
> I'm not entirely sure that it actually works. But it makes some amount of 
> conceptual sense. No?

Linus, are you trying to give me a heart-attack? This sh*t just survived
20(!) hibernation runs without a problem (well, there is this nagging
/sysfs lockdep warning) but apart from that, it survived! I even did my
all time best when hitting on it. Normally, it used to crap up on the
6th cycle as latest. Now we're rock solid. And yes, there were something
like ~64Mb in the swap cache.

Also, I have your verification stuff in addition to the 4 patches you
sent before. Not a single WARN_ONCE got triggered. So I have a gut
feeling that it is fixed but you never know with these beasts.

As before, I'll rebuild and reapply everything in the morning and retest
just in case. And I guess I'll have to test all following -rc's so that
we can be absolutely sure.

So cheers!

-- 
Regards/Gruss,
    Boris.

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA
  2010-04-12 21:50                                                                                                                                                   ` [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA Borislav Petkov
@ 2010-04-12 22:11                                                                                                                                                     ` Linus Torvalds
  2010-04-12 22:18                                                                                                                                                       ` Linus Torvalds
  2010-04-13  9:38                                                                                                                                                       ` Borislav Petkov
  0 siblings, 2 replies; 242+ messages in thread
From: Linus Torvalds @ 2010-04-12 22:11 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Johannes Weiner, KOSAKI Motohiro, Rik van Riel, Andrew Morton,
	Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn,
	Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson



On Mon, 12 Apr 2010, Borislav Petkov wrote:
> 
> > I have a new theory. And this new theory is completely different from all 
> > the other things we've been looking at.
> 
> Yeah, because all starts with "I have a new theory..." :o)

Hey, all my other theories made sense too.. They just didn't work.

But as Edison said: I didn't fail, I just found three other ways to not 
fix your bug.

> > The patch below is my largely mindless try at fixing this. It's untested. 
> > I'm not entirely sure that it actually works. But it makes some amount of 
> > conceptual sense. No?
> 
> Linus, are you trying to give me a heart-attack? This sh*t just survived
> 20(!) hibernation runs without a problem (well, there is this nagging
> /sysfs lockdep warning) but apart from that, it survived! I even did my
> all time best when hitting on it. Normally, it used to crap up on the
> 6th cycle as latest. Now we're rock solid. And yes, there were something
> like ~64Mb in the swap cache.
> 
> Also, I have your verification stuff in addition to the 4 patches you
> sent before. Not a single WARN_ONCE got triggered. So I have a gut
> feeling that it is fixed but you never know with these beasts.

Ok. That does sound very positive. Of course, last time you sounded 
positive, I had an email from you half an hour later that said "oh no, it 
oopsed again". So I'll take it with a bit of salt, but on the whole I'll 
be optimistic about it.

				Linus

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA
  2010-04-12 22:11                                                                                                                                                     ` Linus Torvalds
@ 2010-04-12 22:18                                                                                                                                                       ` Linus Torvalds
  2010-04-12 22:29                                                                                                                                                         ` Borislav Petkov
  2010-04-13  9:38                                                                                                                                                       ` Borislav Petkov
  1 sibling, 1 reply; 242+ messages in thread
From: Linus Torvalds @ 2010-04-12 22:18 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Johannes Weiner, KOSAKI Motohiro, Rik van Riel, Andrew Morton,
	Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn,
	Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson



Oh, btw, I like your email gateway. Only noticed now:

	mail.skyhub.de (SuperMail on ZX Spectrum 128k)

that's a tough little machine.

			Linus

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA
  2010-04-12 22:18                                                                                                                                                       ` Linus Torvalds
@ 2010-04-12 22:29                                                                                                                                                         ` Borislav Petkov
  0 siblings, 0 replies; 242+ messages in thread
From: Borislav Petkov @ 2010-04-12 22:29 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Johannes Weiner, KOSAKI Motohiro, Rik van Riel, Andrew Morton,
	Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn,
	Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson

From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Mon, Apr 12, 2010 at 03:18:20PM -0700

> Oh, btw, I like your email gateway. Only noticed now:
> 
> 	mail.skyhub.de (SuperMail on ZX Spectrum 128k)
> 
> that's a tough little machine.

Yeah, and it can handle all that mail traffic just fine :)

-- 
Regards/Gruss,
    Boris.

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: [PATCH 1/4] Simplify and comment on anon_vma re-use for anon_vma_prepare()
  2010-04-12 20:22                                                                                                                                                         ` [PATCH 1/4] Simplify and comment on anon_vma re-use for anon_vma_prepare() Linus Torvalds
  2010-04-12 20:23                                                                                                                                                           ` [PATCH 2/4] vma_adjust: fix the copying of anon_vma chains Linus Torvalds
  2010-04-12 20:54                                                                                                                                                           ` [PATCH 1/4] Simplify and comment on anon_vma re-use for anon_vma_prepare() Rik van Riel
@ 2010-04-12 23:54                                                                                                                                                           ` Johannes Weiner
  2010-04-13  4:04                                                                                                                                                           ` Minchan Kim
  2010-04-13  9:51                                                                                                                                                           ` Peter Zijlstra
  4 siblings, 0 replies; 242+ messages in thread
From: Johannes Weiner @ 2010-04-12 23:54 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Borislav Petkov, Rik van Riel, KOSAKI Motohiro, Andrew Morton,
	Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn,
	Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson

Hi Linus,

On Mon, Apr 12, 2010 at 01:22:33PM -0700, Linus Torvalds wrote:
> 
> From: Linus Torvalds <torvalds@linux-foundation.org>
> Date: Sat, 10 Apr 2010 10:36:19 -0700
> Subject: [PATCH 1/4] Simplify and comment on anon_vma re-use for anon_vma_prepare()
> 
> This changes the anon_vma reuse case to require that we only reuse
> simple anon_vma's - ie the case when the vma only has a single anon_vma
> associated with it.
> 
> This means that a reuse of an anon_vma from an adjacent vma will always
> guarantee that both vma's are associated not onyl with the same
> anon_vma, they will also have the same anon_vma chain (of just a single
> entry in this case).
> 
> And since anon_vma re-use was the only case where the same anon_vma
> might be associated with different chains of anon_vma's, we now have the
> case that every vma that shares the same vma will always also have the

                                           ^^^ That should be anon_vma?

> same chain.  That makes it much easier to think about merging vma's that
> share the same anon_vma's: you can always just drop the other anon_vma
> chain in anon_vma_merge() since you know that they are always identical.

I like to think of 'incomplete' and 'complete' versions of the same
chain and that this new rule of yours simplifies things by limiting
reuse to the cases where the incomplete and the complete version
end up identical.  I can live with your wording, though :)

> This also splits up the function to validate the anon_vma re-use, and
> adds a lot of commentary about the possible races.
> 
> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

That said, I still don't like that the vma comparisons differ depending
on whether we reuse an anon_vma or merge vmas.  In my happy-place, the
same vma comparison function is predicate for both cases, so I actually
liked that aspect of the old code, but I also see that code reuse is a
PITA in that file...  Ah well, that can still be cleaned up later.

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: [PATCH 2/4] vma_adjust: fix the copying of anon_vma chains
  2010-04-12 20:23                                                                                                                                                           ` [PATCH 2/4] vma_adjust: fix the copying of anon_vma chains Linus Torvalds
  2010-04-12 20:23                                                                                                                                                             ` [PATCH 3/4] anon_vma: clone the anon_vma chain in the right order Linus Torvalds
  2010-04-12 20:54                                                                                                                                                             ` [PATCH 2/4] vma_adjust: fix the copying of anon_vma chains Rik van Riel
@ 2010-04-12 23:59                                                                                                                                                             ` Johannes Weiner
  2010-04-13  4:15                                                                                                                                                             ` Minchan Kim
  3 siblings, 0 replies; 242+ messages in thread
From: Johannes Weiner @ 2010-04-12 23:59 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Borislav Petkov, Rik van Riel, KOSAKI Motohiro, Andrew Morton,
	Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn,
	Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson

On Mon, Apr 12, 2010 at 01:23:04PM -0700, Linus Torvalds wrote:
> 
> From: Linus Torvalds <torvalds@linux-foundation.org>
> Date: Sat, 10 Apr 2010 15:22:30 -0700
> Subject: [PATCH 2/4] vma_adjust: fix the copying of anon_vma chains
> 
> When we move the boundaries between two vma's due to things like
> mprotect, we need to make sure that the anon_vma of the pages that got
> moved from one vma to another gets properly copied around.  And that was
> not always the case, in this rather hard-to-follow code sequence.
> 
> Clarify the code, and fix it so that it copies the anon_vma from the
> right source.
> 
> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>


^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: [PATCH 3/4] anon_vma: clone the anon_vma chain in the right order
  2010-04-12 20:23                                                                                                                                                             ` [PATCH 3/4] anon_vma: clone the anon_vma chain in the right order Linus Torvalds
  2010-04-12 20:23                                                                                                                                                               ` [PATCH 4/4] anonvma: when setting up page->mapping, we need to pick the _oldest_ anonvma Linus Torvalds
  2010-04-12 20:57                                                                                                                                                               ` [PATCH 3/4] anon_vma: clone the anon_vma chain in the right order Rik van Riel
@ 2010-04-13  0:18                                                                                                                                                               ` Johannes Weiner
  2010-04-13  4:16                                                                                                                                                               ` Minchan Kim
  3 siblings, 0 replies; 242+ messages in thread
From: Johannes Weiner @ 2010-04-13  0:18 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Borislav Petkov, Rik van Riel, KOSAKI Motohiro, Andrew Morton,
	Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn,
	Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson

On Mon, Apr 12, 2010 at 01:23:24PM -0700, Linus Torvalds wrote:
> 
> From: Linus Torvalds <torvalds@linux-foundation.org>
> Date: Sun, 11 Apr 2010 17:15:03 -0700
> Subject: [PATCH 3/4] anon_vma: clone the anon_vma chain in the right order
> 
> We want to walk the chain in reverse order when cloning it, so that the
> order of the result chain will be the same as the order in the source
> chain.  When we add entries to the chain, they go at the head of the
> chain, so we want to add the source head last.
> 
> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: [PATCH 4/4] anonvma: when setting up page->mapping, we need to pick the _oldest_ anonvma
  2010-04-12 20:23                                                                                                                                                               ` [PATCH 4/4] anonvma: when setting up page->mapping, we need to pick the _oldest_ anonvma Linus Torvalds
  2010-04-12 21:03                                                                                                                                                                 ` Rik van Riel
@ 2010-04-13  0:41                                                                                                                                                                 ` Johannes Weiner
  2010-04-13  1:08                                                                                                                                                                   ` Linus Torvalds
  1 sibling, 1 reply; 242+ messages in thread
From: Johannes Weiner @ 2010-04-13  0:41 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Borislav Petkov, Rik van Riel, KOSAKI Motohiro, Andrew Morton,
	Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn,
	Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson

On Mon, Apr 12, 2010 at 01:23:50PM -0700, Linus Torvalds wrote:
> 
> From: Linus Torvalds <torvalds@linux-foundation.org>
> Date: Mon, 12 Apr 2010 12:44:29 -0700
> Subject: [PATCH 4/4] anonvma: when setting up page->mapping, we need to pick the _oldest_ anonvma
> 
> Otherwise we might be mapping in a page in a new mapping, but that page
> (through the swapcache) would later be mapped into an old mapping too.
> The page->mapping must be the case that works for everybody, not just
> the mapping that happened to page it in first.
> 
> This can be improved in certain cases: if we know the page is private to
> just this particular mapping (for example, it's a new page, or it is the
> only swapcache entry), we could pick the top (most specific) anon_vma.
> 
> But that's a future optimization. Make it _work_ reliably first.
> 
> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

Would you mind pasting that nice description of the error case from your
other email into that changelog?  I skimmed over the description but when
I read this patch several hours later, I had to go back to that previous
email to fully make sense of it.

> ---
>  mm/rmap.c |   15 +++++++++++++--
>  1 files changed, 13 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/rmap.c b/mm/rmap.c
> index ee97d38..4bad326 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -734,9 +734,20 @@ void page_move_anon_rmap(struct page *page,
>  static void __page_set_anon_rmap(struct page *page,
>  	struct vm_area_struct *vma, unsigned long address)
>  {
> -	struct anon_vma *anon_vma = vma->anon_vma;
> +	struct anon_vma_chain *avc;
> +	struct anon_vma *anon_vma;
> +
> +	BUG_ON(!vma->anon_vma);
> +
> +	/*
> +	 * We must use the _oldest_ possible anon_vma for the page mapping!

I think the key here is not that it's the oldest (past) but also the one with
the longest extent (future), so that it's bound to stay until the last possible
mapping for this page vanishes.

Maybe it's just me, but I doubt the comment as it is would help me understand
that code if I didn't already.

> +	 *
> +	 * So take the last AVC chain entry in the vma, which is the deepest
> +	 * ancestor, and use the anon_vma from that.
> +	 */
> +	avc = list_entry(vma->anon_vma_chain.prev, struct anon_vma_chain, same_vma);
> +	anon_vma = avc->anon_vma;
>  
> -	BUG_ON(!anon_vma);
>  	anon_vma = (void *) anon_vma + PAGE_MAPPING_ANON;
>  	page->mapping = (struct address_space *) anon_vma;
>  	page->index = linear_page_index(vma, address);

Hannes

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: [PATCH 4/4] anonvma: when setting up page->mapping, we need to pick the _oldest_ anonvma
  2010-04-13  0:41                                                                                                                                                                 ` Johannes Weiner
@ 2010-04-13  1:08                                                                                                                                                                   ` Linus Torvalds
  2010-04-13  4:23                                                                                                                                                                     ` Minchan Kim
  0 siblings, 1 reply; 242+ messages in thread
From: Linus Torvalds @ 2010-04-13  1:08 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Borislav Petkov, Rik van Riel, KOSAKI Motohiro, Andrew Morton,
	Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn,
	Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson



On Tue, 13 Apr 2010, Johannes Weiner wrote:
> 
> Would you mind pasting that nice description of the error case from your
> other email into that changelog?  I skimmed over the description but when
> I read this patch several hours later, I had to go back to that previous
> email to fully make sense of it.

It now looks like this..

		Linus
---
From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Mon, 12 Apr 2010 12:44:29 -0700
Subject: [PATCH 4/4] anonvma: when setting up page->mapping, we need to pick the _oldest_ anonvma

Otherwise we might be mapping in a page in a new mapping, but that page
(through the swapcache) would later be mapped into an old mapping too.
The page->mapping must be the case that works for everybody, not just
the mapping that happened to page it in first.

Here's the scenario:

 - page gets allocated/mapped by process A. Let's call the anon_vma we
   associate the page with 'A' to keep it easy to track.

 - Process A forks, creating process B. The anon_vma in B is 'B', and has
   a chain that looks like 'B' -> 'A'. Everything is fine.

 - Swapping happens. The page (with mapping pointing to 'A') gets swapped
   out (perhaps not to disk - it's enough to assume that it's just not
   mapped any more, and lives entirely in the swap-cache)

 - Process B pages it in, which goes like this:

        do_swap_page ->
          page = lookup_swap_cache(entry);
         ...
          set_pte_at(mm, address, page_table, pte);
          page_add_anon_rmap(page, vma, address);

   And think about what happens here!

   In particular, what happens is that this will now be the "first"
   mapping of that page, so page_add_anon_rmap() used to do

        if (first)
                __page_set_anon_rmap(page, vma, address);

   and notice what anon_vma it will use? It will use the anon_vma for
   process B!

   What happens then? Trivial: process 'A' also pages it in (nothing
   happens, it's not the first mapping), and then process 'B' execve's
   or exits or unmaps, making anon_vma B go away.

   End result: process A has a page that points to anon_vma B, but
   anon_vma B does not exist any more.  This can go on forever.  Forget
   about RCU grace periods, forget about locking, forget anything like
   that.  The bug is simply that page->mapping points to an anon_vma
   that was correct at one point, but was _not_ the one that was shared
   by all users of that possible mapping.

Changing it to always use the deepest anon_vma in the anonvma chain gets
us to the safest model.

This can be improved in certain cases: if we know the page is private to
just this particular mapping (for example, it's a new page, or it is the
only swapcache entry), we could pick the top (most specific) anon_vma.

But that's a future optimization. Make it _work_ reliably first.

Reviewed-by: Rik van Riel <riel@redhat.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Tested-by: Borislav Petkov <bp@alien8.de> [ "What do you know, I think you fixed it!" ]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---
 mm/rmap.c |   15 +++++++++++++--
 1 files changed, 13 insertions(+), 2 deletions(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index ee97d38..4bad326 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -734,9 +734,20 @@ void page_move_anon_rmap(struct page *page,
 static void __page_set_anon_rmap(struct page *page,
 	struct vm_area_struct *vma, unsigned long address)
 {
-	struct anon_vma *anon_vma = vma->anon_vma;
+	struct anon_vma_chain *avc;
+	struct anon_vma *anon_vma;
+
+	BUG_ON(!vma->anon_vma);
+
+	/*
+	 * We must use the _oldest_ possible anon_vma for the page mapping!
+	 *
+	 * So take the last AVC chain entry in the vma, which is the deepest
+	 * ancestor, and use the anon_vma from that.
+	 */
+	avc = list_entry(vma->anon_vma_chain.prev, struct anon_vma_chain, same_vma);
+	anon_vma = avc->anon_vma;
 
-	BUG_ON(!anon_vma);
 	anon_vma = (void *) anon_vma + PAGE_MAPPING_ANON;
 	page->mapping = (struct address_space *) anon_vma;
 	page->index = linear_page_index(vma, address);
-- 
1.7.1.rc1.dirty


^ permalink raw reply related	[flat|nested] 242+ messages in thread

* Re: [PATCH 1/4] Simplify and comment on anon_vma re-use for  anon_vma_prepare()
  2010-04-12 20:22                                                                                                                                                         ` [PATCH 1/4] Simplify and comment on anon_vma re-use for anon_vma_prepare() Linus Torvalds
                                                                                                                                                                             ` (2 preceding siblings ...)
  2010-04-12 23:54                                                                                                                                                           ` Johannes Weiner
@ 2010-04-13  4:04                                                                                                                                                           ` Minchan Kim
  2010-04-13  9:51                                                                                                                                                           ` Peter Zijlstra
  4 siblings, 0 replies; 242+ messages in thread
From: Minchan Kim @ 2010-04-13  4:04 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Borislav Petkov, Rik van Riel, Johannes Weiner, KOSAKI Motohiro,
	Andrew Morton, Linux Kernel Mailing List, Lee Schermerhorn,
	Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson

On Tue, Apr 13, 2010 at 5:22 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> From: Linus Torvalds <torvalds@linux-foundation.org>
> Date: Sat, 10 Apr 2010 10:36:19 -0700
> Subject: [PATCH 1/4] Simplify and comment on anon_vma re-use for anon_vma_prepare()
>
> This changes the anon_vma reuse case to require that we only reuse
> simple anon_vma's - ie the case when the vma only has a single anon_vma
> associated with it.
>
> This means that a reuse of an anon_vma from an adjacent vma will always
> guarantee that both vma's are associated not onyl with the same
> anon_vma, they will also have the same anon_vma chain (of just a single
> entry in this case).
>
> And since anon_vma re-use was the only case where the same anon_vma
> might be associated with different chains of anon_vma's, we now have the
> case that every vma that shares the same vma will always also have the

same vma => same anon_vma.

> same chain.  That makes it much easier to think about merging vma's that
> share the same anon_vma's: you can always just drop the other anon_vma
> chain in anon_vma_merge() since you know that they are always identical.
>
> This also splits up the function to validate the anon_vma re-use, and
> adds a lot of commentary about the possible races.
>
> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>



-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: [PATCH 2/4] vma_adjust: fix the copying of anon_vma chains
  2010-04-12 20:23                                                                                                                                                           ` [PATCH 2/4] vma_adjust: fix the copying of anon_vma chains Linus Torvalds
                                                                                                                                                                               ` (2 preceding siblings ...)
  2010-04-12 23:59                                                                                                                                                             ` Johannes Weiner
@ 2010-04-13  4:15                                                                                                                                                             ` Minchan Kim
  3 siblings, 0 replies; 242+ messages in thread
From: Minchan Kim @ 2010-04-13  4:15 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Borislav Petkov, Rik van Riel, Johannes Weiner, KOSAKI Motohiro,
	Andrew Morton, Linux Kernel Mailing List, Lee Schermerhorn,
	Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson

On Tue, Apr 13, 2010 at 5:23 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> From: Linus Torvalds <torvalds@linux-foundation.org>
> Date: Sat, 10 Apr 2010 15:22:30 -0700
> Subject: [PATCH 2/4] vma_adjust: fix the copying of anon_vma chains
>
> When we move the boundaries between two vma's due to things like
> mprotect, we need to make sure that the anon_vma of the pages that got
> moved from one vma to another gets properly copied around.  And that was
> not always the case, in this rather hard-to-follow code sequence.
>
> Clarify the code, and fix it so that it copies the anon_vma from the
> right source.
>
> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: [PATCH 3/4] anon_vma: clone the anon_vma chain in the right order
  2010-04-12 20:23                                                                                                                                                             ` [PATCH 3/4] anon_vma: clone the anon_vma chain in the right order Linus Torvalds
                                                                                                                                                                                 ` (2 preceding siblings ...)
  2010-04-13  0:18                                                                                                                                                               ` Johannes Weiner
@ 2010-04-13  4:16                                                                                                                                                               ` Minchan Kim
  3 siblings, 0 replies; 242+ messages in thread
From: Minchan Kim @ 2010-04-13  4:16 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Borislav Petkov, Rik van Riel, Johannes Weiner, KOSAKI Motohiro,
	Andrew Morton, Linux Kernel Mailing List, Lee Schermerhorn,
	Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson

On Tue, Apr 13, 2010 at 5:23 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> From: Linus Torvalds <torvalds@linux-foundation.org>
> Date: Sun, 11 Apr 2010 17:15:03 -0700
> Subject: [PATCH 3/4] anon_vma: clone the anon_vma chain in the right order
>
> We want to walk the chain in reverse order when cloning it, so that the
> order of the result chain will be the same as the order in the source
> chain.  When we add entries to the chain, they go at the head of the
> chain, so we want to add the source head last.
>
> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>


-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: [PATCH 4/4] anonvma: when setting up page->mapping, we need to  pick the _oldest_ anonvma
  2010-04-13  1:08                                                                                                                                                                   ` Linus Torvalds
@ 2010-04-13  4:23                                                                                                                                                                     ` Minchan Kim
  2010-04-13  4:26                                                                                                                                                                       ` Minchan Kim
  0 siblings, 1 reply; 242+ messages in thread
From: Minchan Kim @ 2010-04-13  4:23 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Johannes Weiner, Borislav Petkov, Rik van Riel, KOSAKI Motohiro,
	Andrew Morton, Linux Kernel Mailing List, Lee Schermerhorn,
	Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson

On Tue, Apr 13, 2010 at 10:08 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
>
> On Tue, 13 Apr 2010, Johannes Weiner wrote:
>>
>> Would you mind pasting that nice description of the error case from your
>> other email into that changelog?  I skimmed over the description but when
>> I read this patch several hours later, I had to go back to that previous
>> email to fully make sense of it.
>
> It now looks like this..
>
>                Linus
> ---
> From: Linus Torvalds <torvalds@linux-foundation.org>
> Date: Mon, 12 Apr 2010 12:44:29 -0700
> Subject: [PATCH 4/4] anonvma: when setting up page->mapping, we need to pick the _oldest_ anonvma
>
> Otherwise we might be mapping in a page in a new mapping, but that page
> (through the swapcache) would later be mapped into an old mapping too.
> The page->mapping must be the case that works for everybody, not just
> the mapping that happened to page it in first.
>
> Here's the scenario:
>
>  - page gets allocated/mapped by process A. Let's call the anon_vma we
>   associate the page with 'A' to keep it easy to track.
>
>  - Process A forks, creating process B. The anon_vma in B is 'B', and has
>   a chain that looks like 'B' -> 'A'. Everything is fine.
>
>  - Swapping happens. The page (with mapping pointing to 'A') gets swapped
>   out (perhaps not to disk - it's enough to assume that it's just not
>   mapped any more, and lives entirely in the swap-cache)
>
>  - Process B pages it in, which goes like this:
>
>        do_swap_page ->
>          page = lookup_swap_cache(entry);
>         ...
>          set_pte_at(mm, address, page_table, pte);
>          page_add_anon_rmap(page, vma, address);
>
>   And think about what happens here!
>
>   In particular, what happens is that this will now be the "first"
>   mapping of that page, so page_add_anon_rmap() used to do
>
>        if (first)
>                __page_set_anon_rmap(page, vma, address);
>
>   and notice what anon_vma it will use? It will use the anon_vma for
>   process B!
>
>   What happens then? Trivial: process 'A' also pages it in (nothing
>   happens, it's not the first mapping), and then process 'B' execve's
>   or exits or unmaps, making anon_vma B go away.
>
>   End result: process A has a page that points to anon_vma B, but
>   anon_vma B does not exist any more.  This can go on forever.  Forget
>   about RCU grace periods, forget about locking, forget anything like
>   that.  The bug is simply that page->mapping points to an anon_vma
>   that was correct at one point, but was _not_ the one that was shared
>   by all users of that possible mapping.
>
> Changing it to always use the deepest anon_vma in the anonvma chain gets
> us to the safest model.
>
> This can be improved in certain cases: if we know the page is private to
> just this particular mapping (for example, it's a new page, or it is the
> only swapcache entry), we could pick the top (most specific) anon_vma.
>
> But that's a future optimization. Make it _work_ reliably first.
>
> Reviewed-by: Rik van Riel <riel@redhat.com>
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> Tested-by: Borislav Petkov <bp@alien8.de> [ "What do you know, I think you fixed it!" ]
> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: Minchan Kim <minchan.kim>

It was great hunting and was a chance to learn many things
from LKML smart guys.
I feel again about OSS's power and great procedure of linux evolution

Thanks for everybody.

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: [PATCH 4/4] anonvma: when setting up page->mapping, we need to  pick the _oldest_ anonvma
  2010-04-13  4:23                                                                                                                                                                     ` Minchan Kim
@ 2010-04-13  4:26                                                                                                                                                                       ` Minchan Kim
  0 siblings, 0 replies; 242+ messages in thread
From: Minchan Kim @ 2010-04-13  4:26 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Johannes Weiner, Borislav Petkov, Rik van Riel, KOSAKI Motohiro,
	Andrew Morton, Linux Kernel Mailing List, Lee Schermerhorn,
	Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson

On Tue, Apr 13, 2010 at 1:23 PM, Minchan Kim <minchan.kim@gmail.com> wrote:
> On Tue, Apr 13, 2010 at 10:08 AM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>>
>>
>> On Tue, 13 Apr 2010, Johannes Weiner wrote:
>>>
>>> Would you mind pasting that nice description of the error case from your
>>> other email into that changelog?  I skimmed over the description but when
>>> I read this patch several hours later, I had to go back to that previous
>>> email to fully make sense of it.
>>
>> It now looks like this..
>>
>>                Linus
>> ---
>> From: Linus Torvalds <torvalds@linux-foundation.org>
>> Date: Mon, 12 Apr 2010 12:44:29 -0700
>> Subject: [PATCH 4/4] anonvma: when setting up page->mapping, we need to pick the _oldest_ anonvma
>>
>> Otherwise we might be mapping in a page in a new mapping, but that page
>> (through the swapcache) would later be mapped into an old mapping too.
>> The page->mapping must be the case that works for everybody, not just
>> the mapping that happened to page it in first.
>>
>> Here's the scenario:
>>
>>  - page gets allocated/mapped by process A. Let's call the anon_vma we
>>   associate the page with 'A' to keep it easy to track.
>>
>>  - Process A forks, creating process B. The anon_vma in B is 'B', and has
>>   a chain that looks like 'B' -> 'A'. Everything is fine.
>>
>>  - Swapping happens. The page (with mapping pointing to 'A') gets swapped
>>   out (perhaps not to disk - it's enough to assume that it's just not
>>   mapped any more, and lives entirely in the swap-cache)
>>
>>  - Process B pages it in, which goes like this:
>>
>>        do_swap_page ->
>>          page = lookup_swap_cache(entry);
>>         ...
>>          set_pte_at(mm, address, page_table, pte);
>>          page_add_anon_rmap(page, vma, address);
>>
>>   And think about what happens here!
>>
>>   In particular, what happens is that this will now be the "first"
>>   mapping of that page, so page_add_anon_rmap() used to do
>>
>>        if (first)
>>                __page_set_anon_rmap(page, vma, address);
>>
>>   and notice what anon_vma it will use? It will use the anon_vma for
>>   process B!
>>
>>   What happens then? Trivial: process 'A' also pages it in (nothing
>>   happens, it's not the first mapping), and then process 'B' execve's
>>   or exits or unmaps, making anon_vma B go away.
>>
>>   End result: process A has a page that points to anon_vma B, but
>>   anon_vma B does not exist any more.  This can go on forever.  Forget
>>   about RCU grace periods, forget about locking, forget anything like
>>   that.  The bug is simply that page->mapping points to an anon_vma
>>   that was correct at one point, but was _not_ the one that was shared
>>   by all users of that possible mapping.
>>
>> Changing it to always use the deepest anon_vma in the anonvma chain gets
>> us to the safest model.
>>
>> This can be improved in certain cases: if we know the page is private to
>> just this particular mapping (for example, it's a new page, or it is the
>> only swapcache entry), we could pick the top (most specific) anon_vma.
>>
>> But that's a future optimization. Make it _work_ reliably first.
>>
>> Reviewed-by: Rik van Riel <riel@redhat.com>
>> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
>> Tested-by: Borislav Petkov <bp@alien8.de> [ "What do you know, I think you fixed it!" ]
>> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>

Sorry for mistake.
I was extremely excited. :)

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA
  2010-04-12 22:11                                                                                                                                                     ` Linus Torvalds
  2010-04-12 22:18                                                                                                                                                       ` Linus Torvalds
@ 2010-04-13  9:38                                                                                                                                                       ` Borislav Petkov
  2010-04-14 21:59                                                                                                                                                         ` [PATCH] rmap: add exclusively owned pages to the newest anon_vma Rik van Riel
  1 sibling, 1 reply; 242+ messages in thread
From: Borislav Petkov @ 2010-04-13  9:38 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Johannes Weiner, KOSAKI Motohiro, Rik van Riel, Andrew Morton,
	Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn,
	Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson

From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Mon, Apr 12, 2010 at 03:11:53PM -0700

> Ok. That does sound very positive. Of course, last time you sounded 
> positive, I had an email from you half an hour later that said "oh no, it 
> oopsed again". So I'll take it with a bit of salt, but on the whole I'll 
> be optimistic about it.

Ok, just finished testing -rc4 - no problems so far. Let's just go out
on a limb here and say with a greater certainty that this really got
fixed but be smart about it and keep an eye open if it happens again -
you never know.

Where is the champagne?

-- 
Regards/Gruss,
    Boris.

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: [PATCH 1/4] Simplify and comment on anon_vma re-use for anon_vma_prepare()
  2010-04-12 20:22                                                                                                                                                         ` [PATCH 1/4] Simplify and comment on anon_vma re-use for anon_vma_prepare() Linus Torvalds
                                                                                                                                                                             ` (3 preceding siblings ...)
  2010-04-13  4:04                                                                                                                                                           ` Minchan Kim
@ 2010-04-13  9:51                                                                                                                                                           ` Peter Zijlstra
  4 siblings, 0 replies; 242+ messages in thread
From: Peter Zijlstra @ 2010-04-13  9:51 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Borislav Petkov, Rik van Riel, Johannes Weiner, KOSAKI Motohiro,
	Andrew Morton, Minchan Kim, Linux Kernel Mailing List,
	Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins,
	sgunderson

On Mon, 2010-04-12 at 13:22 -0700, Linus Torvalds wrote:
> +static int anon_vma_compatible(struct vm_area_struct *a, struct vm_area_struct *b)
> +{
> +       return a->vm_end == b->vm_start &&
> +               mpol_equal(vma_policy(a), vma_policy(b)) &&
> +               a->vm_file == b->vm_file &&
> +               !((a->vm_flags ^ b->vm_flags) & ~(VM_READ|VM_WRITE|VM_EXEC)) &&
> +               b->vm_pgoff == a->vm_pgoff + ((b->vm_start - a->vm_start) >> PAGE_SHIFT);
> +} 

Maybe write that as:

static int anon_vma_compatible(struct vm_area_struct *a, struct vm_area_struct *b)
{
	if (a->vm_end != b->vm_start)
		return 0;

	if (!mpol_equal(vma_policy(a), vma_policy(b))
		return 0;

	if (a->vm_file != b->vm_file)
		return 0;

	if ((a->vm_flags ^ b->vm_flags) & ~(VM_READ|VM_WRITE|VM_EXEC))
		return 0;

	if (a->vm_pgoff + ((b->vm_start - a->vm_start) >> PAGE_SHIFT) != b->vm_pgoff)
		return 0;

	return 1;
}



^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA
  2010-04-12 15:51                                                                                                                                         ` Linus Torvalds
@ 2010-04-13 10:36                                                                                                                                           ` KOSAKI Motohiro
  0 siblings, 0 replies; 242+ messages in thread
From: KOSAKI Motohiro @ 2010-04-13 10:36 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: kosaki.motohiro, Rik van Riel, Borislav Petkov, Johannes Weiner,
	Andrew Morton, Minchan Kim, Linux Kernel Mailing List,
	Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins,
	sgunderson

Hi Linus,

> On Sun, 11 Apr 2010, Rik van Riel wrote:
> >
> > Another thing I just thought of.
> > 
> > The anon_vma struct will not be reused for something completely
> > different due to the SLAB_DESTROY_BY_RCU flag that the anon_vma_cachep
> > is created with.
> 
> Rik, we _know_ it got re-used by something totally different. That's 
> clearly the problem. The page->mapping pointer does _not_ point to an 
> anon_vma any more. That's the problem here.
> 
> What we need to figure out is how we have a page on the LRU list that is 
> still marked as 'mapped' that has that stale mapping pointer.
> 
> I can easily see how the stale mapping pointer happens for a non-mapped 
> page. That part is trivial. Here's a simple case:
> 
>  - vmscan does that whole "isolate LRU pages", and one of them is a (at 
>    that time mapped) anonymous page. It's now not on any LRU lists at all.
> 
>  - vmscan ends up waiting for pageout and/or writeback while holding that 
>    list of pages.
> 
>  - in the meantime, the process that had the page exists or unmaps, 
>    unmapping the page and freeing the vma and the anon_vma.
> 
>  - vmscan eventually gets to the page, and does that page_referenced() 
>    dance. page->mapping points to something that is long long gone (as in 
>    "IO access lifetimes", so we're talking something that has been freed 
>    literally milliseconds ago, rather than any RCU delays)
> 
> So I can see the stale page->mapping pointer happening. That part is even 
> trivial. What I don't see is how the page would be still marked 'mapped'. 
> Everything that actually free's the vma/anon_vmas should also have 
> unmapped the page before that - even if it didn't _free_ the page.

Sorry, Now I'm lost what discuss in this crazy long thread.
IIUC, If the page->mapping was freed millisecns ago, following (1)
check returen false and we never touch page->mapping literally.

Am I missing something?


===================================================================
struct anon_vma *page_lock_anon_vma(struct page *page)
{
        struct anon_vma *anon_vma;
        unsigned long anon_mapping;

        rcu_read_lock();
        anon_mapping = (unsigned long) ACCESS_ONCE(page->mapping);
        if ((anon_mapping & PAGE_MAPPING_FLAGS) != PAGE_MAPPING_ANON)
                goto out;
        if (!page_mapped(page))       /* (1) here */ 
                goto out;

        anon_vma = (struct anon_vma *) (anon_mapping - PAGE_MAPPING_ANON);
        spin_lock(&anon_vma->lock);
        return anon_vma;
out:
        rcu_read_unlock();
        return NULL;
}
=================================================


And, I think your following patch seems incorrect.
The added page_mapped() is called after spinlock(anon_vma->lock),
it mean check-after-dereference. such check doesn't prevent invalid
pointer dereference, I think.

perhaps, I'm missing anything. I have to reread this thread at all from
first.

---
diff --git a/mm/rmap.c b/mm/rmap.c
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -302,7 +302,11 @@ struct anon_vma *page_lock_anon_vma(struct page *page)
 
 	anon_vma = (struct anon_vma *) (anon_mapping - PAGE_MAPPING_ANON);
 	spin_lock(&anon_vma->lock);
-	return anon_vma;
+
+	if (page_mapped(page))
+		return anon_vma;
+
+	spin_unlock(&anon_vma->lock);
 out:
 	rcu_read_unlock();
 	return NULL;








^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA
  2010-04-12 16:01                                                                                                                   ` Peter Zijlstra
  2010-04-12 16:06                                                                                                                     ` Rik van Riel
@ 2010-04-13 10:53                                                                                                                     ` KOSAKI Motohiro
  2010-04-13 11:30                                                                                                                       ` Peter Zijlstra
  1 sibling, 1 reply; 242+ messages in thread
From: KOSAKI Motohiro @ 2010-04-13 10:53 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: kosaki.motohiro, Rik van Riel, Linus Torvalds, Borislav Petkov,
	Johannes Weiner, Andrew Morton, Minchan Kim,
	Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin,
	Andrea Arcangeli, Hugh Dickins, sgunderson

>  struct anon_vma *page_lock_anon_vma(struct page *page)
>  {
> @@ -294,14 +309,24 @@ struct anon_vma *page_lock_anon_vma(struct page *page)
>  	unsigned long anon_mapping;
>  
>  	rcu_read_lock();
> -	anon_mapping = (unsigned long) ACCESS_ONCE(page->mapping);
> +	anon_mapping = (unsigned long)rcu_dereference(page->mapping);
>  	if ((anon_mapping & PAGE_MAPPING_FLAGS) != PAGE_MAPPING_ANON)
>  		goto out;
> -	if (!page_mapped(page))
> -		goto out;
>  
>  	anon_vma = (struct anon_vma *) (anon_mapping - PAGE_MAPPING_ANON);
>  	spin_lock(&anon_vma->lock);

Does anon->lock dereference is guranteed if page->_mapcount==-1?
It can be freed miliseconds ago, rcu_read_lock() doesn't provide such
gurantee.

perhaps, I'm missing your point.



^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA
  2010-04-13 10:53                                                                                                                     ` KOSAKI Motohiro
@ 2010-04-13 11:30                                                                                                                       ` Peter Zijlstra
  2010-04-13 12:00                                                                                                                         ` KOSAKI Motohiro
  0 siblings, 1 reply; 242+ messages in thread
From: Peter Zijlstra @ 2010-04-13 11:30 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Rik van Riel, Linus Torvalds, Borislav Petkov, Johannes Weiner,
	Andrew Morton, Minchan Kim, Linux Kernel Mailing List,
	Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins,
	sgunderson

On Tue, 2010-04-13 at 19:53 +0900, KOSAKI Motohiro wrote:
> >  struct anon_vma *page_lock_anon_vma(struct page *page)
> >  {
> > @@ -294,14 +309,24 @@ struct anon_vma *page_lock_anon_vma(struct page *page)
> >  	unsigned long anon_mapping;
> >  
> >  	rcu_read_lock();
> > -	anon_mapping = (unsigned long) ACCESS_ONCE(page->mapping);
> > +	anon_mapping = (unsigned long)rcu_dereference(page->mapping);
> >  	if ((anon_mapping & PAGE_MAPPING_FLAGS) != PAGE_MAPPING_ANON)
> >  		goto out;
> > -	if (!page_mapped(page))
> > -		goto out;
> >  
> >  	anon_vma = (struct anon_vma *) (anon_mapping - PAGE_MAPPING_ANON);
> >  	spin_lock(&anon_vma->lock);
> 
> Does anon->lock dereference is guranteed if page->_mapcount==-1?
> It can be freed miliseconds ago, rcu_read_lock() doesn't provide such
> gurantee.
> 
> perhaps, I'm missing your point.

No you're right, I got my head hopelessly twisted up trying to make
page_lock_anon_vma() do something reliable, but there really isn't much
that can be done.

Luckily most users (with exception of the memory-failure.c one) don't
really care and all take steps to verify the page is indeed in any of
the vmas it might find.

So I've given up on this and will only submit a patch like the below,
which hopefully does still make sense...

I do think there's a missing barrier in there as well, but I've made
enough of a fool of myself.

[ with the preemptible mmu_gather patches I introduce a refcount to
  the anon_vma, and then with atomic_inc_not_zero() we can add a
  guarantee that the returned anon_vma is alive ]

---
 mm/rmap.c |   18 ++++++++++++++++--
 1 files changed, 16 insertions(+), 2 deletions(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index eaa7a09..49a2533 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -285,8 +285,22 @@ void __init anon_vma_init(void)
 }
 
 /*
- * Getting a lock on a stable anon_vma from a page off the LRU is
- * tricky: page_lock_anon_vma rely on RCU to guard against the races.
+ * Getting a lock on a stable anon_vma from a page off the LRU is tricky!
+ *
+ * Since there is no serialization what so ever against page_remove_rmap()
+ * the best this function can do is return a locked anon_vma that might
+ * have been relevant to this page.
+ *
+ * The page might have been remapped to a different anon_vma or the anon_vma
+ * returned may already be freed (and even reused).
+ *
+ * All users of this function must be very careful when walking the anon_vma
+ * chain and verify that the page in question is indeed mapped in it
+ * [ something equivalent to page_mapped_in_vma() ].
+ *
+ * Since anon_vma's slab is DESTROY_BY_RCU and we know from page_remove_rmap()
+ * that the anon_vma pointer from page->mapping is valid if there is a
+ * mapcount, we can dereference the anon_vma after observing those.
  */
 struct anon_vma *page_lock_anon_vma(struct page *page)
 {


^ permalink raw reply related	[flat|nested] 242+ messages in thread

* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA
  2010-04-13 11:30                                                                                                                       ` Peter Zijlstra
@ 2010-04-13 12:00                                                                                                                         ` KOSAKI Motohiro
  2010-04-14 14:27                                                                                                                           ` Peter Zijlstra
  0 siblings, 1 reply; 242+ messages in thread
From: KOSAKI Motohiro @ 2010-04-13 12:00 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: kosaki.motohiro, Rik van Riel, Linus Torvalds, Borislav Petkov,
	Johannes Weiner, Andrew Morton, Minchan Kim,
	Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin,
	Andrea Arcangeli, Hugh Dickins, sgunderson

> On Tue, 2010-04-13 at 19:53 +0900, KOSAKI Motohiro wrote:
> > >  struct anon_vma *page_lock_anon_vma(struct page *page)
> > >  {
> > > @@ -294,14 +309,24 @@ struct anon_vma *page_lock_anon_vma(struct page *page)
> > >  	unsigned long anon_mapping;
> > >  
> > >  	rcu_read_lock();
> > > -	anon_mapping = (unsigned long) ACCESS_ONCE(page->mapping);
> > > +	anon_mapping = (unsigned long)rcu_dereference(page->mapping);
> > >  	if ((anon_mapping & PAGE_MAPPING_FLAGS) != PAGE_MAPPING_ANON)
> > >  		goto out;
> > > -	if (!page_mapped(page))
> > > -		goto out;
> > >  
> > >  	anon_vma = (struct anon_vma *) (anon_mapping - PAGE_MAPPING_ANON);
> > >  	spin_lock(&anon_vma->lock);
> > 
> > Does anon->lock dereference is guranteed if page->_mapcount==-1?
> > It can be freed miliseconds ago, rcu_read_lock() doesn't provide such
> > gurantee.
> > 
> > perhaps, I'm missing your point.
> 
> No you're right, I got my head hopelessly twisted up trying to make
> page_lock_anon_vma() do something reliable, but there really isn't much
> that can be done.
> 
> Luckily most users (with exception of the memory-failure.c one) don't
> really care and all take steps to verify the page is indeed in any of
> the vmas it might find.
> 
> So I've given up on this and will only submit a patch like the below,
> which hopefully does still make sense...
> 
> I do think there's a missing barrier in there as well, but I've made
> enough of a fool of myself.
> 
> [ with the preemptible mmu_gather patches I introduce a refcount to
>   the anon_vma, and then with atomic_inc_not_zero() we can add a
>   guarantee that the returned anon_vma is alive ]

Indeed. refcount is best way. anon_vma DESTROY_BY_RCU stuff seems
overengineering, I think. this is fastest, but anon_vma allocation is not
(and was not) fork/exit bottleneck point. So, I guess most simply way is
best.


Also following patch looks good to me.
	Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

Thanks for that. I've thought this is really necessary. but my (very) poor
english skill make hesitate it to me. sorry my laziness ;)



> 
> ---
>  mm/rmap.c |   18 ++++++++++++++++--
>  1 files changed, 16 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/rmap.c b/mm/rmap.c
> index eaa7a09..49a2533 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -285,8 +285,22 @@ void __init anon_vma_init(void)
>  }
>  
>  /*
> - * Getting a lock on a stable anon_vma from a page off the LRU is
> - * tricky: page_lock_anon_vma rely on RCU to guard against the races.
> + * Getting a lock on a stable anon_vma from a page off the LRU is tricky!
> + *
> + * Since there is no serialization what so ever against page_remove_rmap()
> + * the best this function can do is return a locked anon_vma that might
> + * have been relevant to this page.
> + *
> + * The page might have been remapped to a different anon_vma or the anon_vma
> + * returned may already be freed (and even reused).
> + *
> + * All users of this function must be very careful when walking the anon_vma
> + * chain and verify that the page in question is indeed mapped in it
> + * [ something equivalent to page_mapped_in_vma() ].
> + *
> + * Since anon_vma's slab is DESTROY_BY_RCU and we know from page_remove_rmap()
> + * that the anon_vma pointer from page->mapping is valid if there is a
> + * mapcount, we can dereference the anon_vma after observing those.
>   */
>  struct anon_vma *page_lock_anon_vma(struct page *page)
>  {
> 




^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA
  2010-04-13 12:00                                                                                                                         ` KOSAKI Motohiro
@ 2010-04-14 14:27                                                                                                                           ` Peter Zijlstra
  0 siblings, 0 replies; 242+ messages in thread
From: Peter Zijlstra @ 2010-04-14 14:27 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Rik van Riel, Linus Torvalds, Borislav Petkov, Johannes Weiner,
	Andrew Morton, Minchan Kim, Linux Kernel Mailing List,
	Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins,
	sgunderson

On Tue, 2010-04-13 at 21:00 +0900, KOSAKI Motohiro wrote:
> > [ with the preemptible mmu_gather patches I introduce a refcount to
> >   the anon_vma, and then with atomic_inc_not_zero() we can add a
> >   guarantee that the returned anon_vma is alive ]
> 
> Indeed. refcount is best way. anon_vma DESTROY_BY_RCU stuff seems
> overengineering, I think. this is fastest, but anon_vma allocation is not
> (and was not) fork/exit bottleneck point. So, I guess most simply way is
> best. 

Well, that refcount stuff still relies on DESTROY_BY_RCU :-)

Anyway, it also looks like a lot of races are avoided by ordering the
rmap_add/remove calls wrt to adding/removing the page to/from the LRU.

Rmap calls come from LRU pages, and it looks like rmap state is only
changed for pages that are not on the LRU.

I still have to go through all that code again to make sure, but I
couldn't find a race between page_add_anon_rmap() and
page_lock_anon_vma() due to that.

If there is, we need to look at page_mapped() before page->mapping
because page_add_anon_rmap() first increments the mapcount and only then
adjusts the mapping, so the existing order in page_anon_lock_vma() can
end up dereferencing a long dead anon_vma.







^ permalink raw reply	[flat|nested] 242+ messages in thread

* [PATCH] rmap: add exclusively owned pages to the newest anon_vma
  2010-04-13  9:38                                                                                                                                                       ` Borislav Petkov
@ 2010-04-14 21:59                                                                                                                                                         ` Rik van Riel
  2010-04-14 23:20                                                                                                                                                           ` Johannes Weiner
                                                                                                                                                                             ` (3 more replies)
  0 siblings, 4 replies; 242+ messages in thread
From: Rik van Riel @ 2010-04-14 21:59 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Linus Torvalds, Johannes Weiner, KOSAKI Motohiro, Andrew Morton,
	Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn,
	Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson

The recent anon_vma fixes cause many anonymous pages to end up
in the parent process anon_vma, even when the page is exclusively
owned by the current process.

Adding exclusively owned anonymous pages to the top anon_vma
reduces rmap scanning overhead, especially in workloads with
forking servers.

This patch adds a parameter to __page_set_anon_rmap that can
be used to indicate whether or not the added page is exclusively
owned by the current process.

Pages added through page_add_new_anon_rmap are exclusively
owned by the current process, and can be added to the top
anon_vma.

Pages added through page_add_anon_rmap can be either shared
or exclusively owned, so we do the conservative thing and
add it to the oldest anon_vma.

A next step would be to add the exclusive parameter to
page_add_anon_rmap, to be used from functions where we do
know for sure whether a page is exclusively owned.

Signed-off-by: Rik van Riel <riel@redhat.com>
---
Borislav, I audited the code before making this change, but would
still appreciate your testing of this patch :)

Linus, once this patch survives Borislav's testing, I'll start
looking at the next step. I'd like to do things one step at a
time so I won't cause another regression...

 mm/rmap.c |   30 +++++++++++++++++++-----------
 1 files changed, 19 insertions(+), 11 deletions(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index 4bad326..12ac0f1 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -730,23 +730,31 @@ void page_move_anon_rmap(struct page *page,
  * @page:	the page to add the mapping to
  * @vma:	the vm area in which the mapping is added
  * @address:	the user virtual address mapped
+ * @exclusive:  the page is exclusively owned by the current process
  */
 static void __page_set_anon_rmap(struct page *page,
-	struct vm_area_struct *vma, unsigned long address)
+	struct vm_area_struct *vma, unsigned long address, int exclusive)
 {
 	struct anon_vma_chain *avc;
 	struct anon_vma *anon_vma;
 
 	BUG_ON(!vma->anon_vma);
 
-	/*
-	 * We must use the _oldest_ possible anon_vma for the page mapping!
-	 *
-	 * So take the last AVC chain entry in the vma, which is the deepest
-	 * ancestor, and use the anon_vma from that.
-	 */
-	avc = list_entry(vma->anon_vma_chain.prev, struct anon_vma_chain, same_vma);
-	anon_vma = avc->anon_vma;
+	if (exclusive)
+		anon_vma = vma->anon_vma;
+	else {
+		/*
+		 * The page may be shared between multiple processes.
+		 * We must use the _oldest_ possible anon_vma for the
+		 * page mapping!  That anon_vma is guaranteed to be
+		 * present in all processes that could share this page.
+		 *
+		 * So take the last AVC chain entry in the vma, which is the
+		 * deepest ancestor, and use the anon_vma from that.
+		 */
+		avc = list_entry(vma->anon_vma_chain.prev, struct anon_vma_chain, same_vma);
+		anon_vma = avc->anon_vma;
+	}
 
 	anon_vma = (void *) anon_vma + PAGE_MAPPING_ANON;
 	page->mapping = (struct address_space *) anon_vma;
@@ -802,7 +810,7 @@ void page_add_anon_rmap(struct page *page,
 	VM_BUG_ON(!PageLocked(page));
 	VM_BUG_ON(address < vma->vm_start || address >= vma->vm_end);
 	if (first)
-		__page_set_anon_rmap(page, vma, address);
+		__page_set_anon_rmap(page, vma, address, 0);
 	else
 		__page_check_anon_rmap(page, vma, address);
 }
@@ -824,7 +832,7 @@ void page_add_new_anon_rmap(struct page *page,
 	SetPageSwapBacked(page);
 	atomic_set(&page->_mapcount, 0); /* increment count (starts at -1) */
 	__inc_zone_page_state(page, NR_ANON_PAGES);
-	__page_set_anon_rmap(page, vma, address);
+	__page_set_anon_rmap(page, vma, address, 1);
 	if (page_evictable(page, vma))
 		lru_cache_add_lru(page, LRU_ACTIVE_ANON);
 	else


^ permalink raw reply related	[flat|nested] 242+ messages in thread

* Re: [PATCH] rmap: add exclusively owned pages to the newest anon_vma
  2010-04-14 21:59                                                                                                                                                         ` [PATCH] rmap: add exclusively owned pages to the newest anon_vma Rik van Riel
@ 2010-04-14 23:20                                                                                                                                                           ` Johannes Weiner
  2010-04-15  8:34                                                                                                                                                           ` Borislav Petkov
                                                                                                                                                                             ` (2 subsequent siblings)
  3 siblings, 0 replies; 242+ messages in thread
From: Johannes Weiner @ 2010-04-14 23:20 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Borislav Petkov, Linus Torvalds, KOSAKI Motohiro, Andrew Morton,
	Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn,
	Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson

On Wed, Apr 14, 2010 at 05:59:28PM -0400, Rik van Riel wrote:
> The recent anon_vma fixes cause many anonymous pages to end up
> in the parent process anon_vma, even when the page is exclusively
> owned by the current process.
> 
> Adding exclusively owned anonymous pages to the top anon_vma
> reduces rmap scanning overhead, especially in workloads with
> forking servers.
> 
> This patch adds a parameter to __page_set_anon_rmap that can
> be used to indicate whether or not the added page is exclusively
> owned by the current process.
> 
> Pages added through page_add_new_anon_rmap are exclusively
> owned by the current process, and can be added to the top
> anon_vma.
> 
> Pages added through page_add_anon_rmap can be either shared
> or exclusively owned, so we do the conservative thing and
> add it to the oldest anon_vma.
> 
> A next step would be to add the exclusive parameter to
> page_add_anon_rmap, to be used from functions where we do
> know for sure whether a page is exclusively owned.
> 
> Signed-off-by: Rik van Riel <riel@redhat.com>

Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: [PATCH] rmap: add exclusively owned pages to the newest anon_vma
  2010-04-14 21:59                                                                                                                                                         ` [PATCH] rmap: add exclusively owned pages to the newest anon_vma Rik van Riel
  2010-04-14 23:20                                                                                                                                                           ` Johannes Weiner
@ 2010-04-15  8:34                                                                                                                                                           ` Borislav Petkov
  2010-04-15 16:02                                                                                                                                                           ` Minchan Kim
  2010-04-15 20:01                                                                                                                                                           ` Linus Torvalds
  3 siblings, 0 replies; 242+ messages in thread
From: Borislav Petkov @ 2010-04-15  8:34 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Linus Torvalds, Johannes Weiner, KOSAKI Motohiro, Andrew Morton,
	Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn,
	Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson

From: Rik van Riel <riel@redhat.com>
Date: Wed, Apr 14, 2010 at 05:59:28PM -0400

> The recent anon_vma fixes cause many anonymous pages to end up
> in the parent process anon_vma, even when the page is exclusively
> owned by the current process.
> 
> Adding exclusively owned anonymous pages to the top anon_vma
> reduces rmap scanning overhead, especially in workloads with
> forking servers.
> 
> This patch adds a parameter to __page_set_anon_rmap that can
> be used to indicate whether or not the added page is exclusively
> owned by the current process.
> 
> Pages added through page_add_new_anon_rmap are exclusively
> owned by the current process, and can be added to the top
> anon_vma.
> 
> Pages added through page_add_anon_rmap can be either shared
> or exclusively owned, so we do the conservative thing and
> add it to the oldest anon_vma.
> 
> A next step would be to add the exclusive parameter to
> page_add_anon_rmap, to be used from functions where we do
> know for sure whether a page is exclusively owned.
> 
> Signed-off-by: Rik van Riel <riel@redhat.com>
> ---
> Borislav, I audited the code before making this change, but would
> still appreciate your testing of this patch :)

Just did some light hammering and it looks ok so far. I'll keep watching
out for oopsies/issues.

Lightly-tested-by: Borislav Petkov <bp@alien8.de>

-- 
Regards/Gruss,
    Boris.

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: [PATCH] rmap: add exclusively owned pages to the newest anon_vma
  2010-04-14 21:59                                                                                                                                                         ` [PATCH] rmap: add exclusively owned pages to the newest anon_vma Rik van Riel
  2010-04-14 23:20                                                                                                                                                           ` Johannes Weiner
  2010-04-15  8:34                                                                                                                                                           ` Borislav Petkov
@ 2010-04-15 16:02                                                                                                                                                           ` Minchan Kim
  2010-04-15 20:01                                                                                                                                                           ` Linus Torvalds
  3 siblings, 0 replies; 242+ messages in thread
From: Minchan Kim @ 2010-04-15 16:02 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Borislav Petkov, Linus Torvalds, Johannes Weiner,
	KOSAKI Motohiro, Andrew Morton, Linux Kernel Mailing List,
	Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins,
	sgunderson

On Thu, Apr 15, 2010 at 6:59 AM, Rik van Riel <riel@redhat.com> wrote:
> The recent anon_vma fixes cause many anonymous pages to end up
> in the parent process anon_vma, even when the page is exclusively
> owned by the current process.
>
> Adding exclusively owned anonymous pages to the top anon_vma
> reduces rmap scanning overhead, especially in workloads with
> forking servers.
>
> This patch adds a parameter to __page_set_anon_rmap that can
> be used to indicate whether or not the added page is exclusively
> owned by the current process.
>
> Pages added through page_add_new_anon_rmap are exclusively
> owned by the current process, and can be added to the top
> anon_vma.
>
> Pages added through page_add_anon_rmap can be either shared
> or exclusively owned, so we do the conservative thing and
> add it to the oldest anon_vma.
>
> A next step would be to add the exclusive parameter to
> page_add_anon_rmap, to be used from functions where we do
> know for sure whether a page is exclusively owned.
>
> Signed-off-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>


-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: [PATCH] rmap: add exclusively owned pages to the newest anon_vma
  2010-04-14 21:59                                                                                                                                                         ` [PATCH] rmap: add exclusively owned pages to the newest anon_vma Rik van Riel
                                                                                                                                                                             ` (2 preceding siblings ...)
  2010-04-15 16:02                                                                                                                                                           ` Minchan Kim
@ 2010-04-15 20:01                                                                                                                                                           ` Linus Torvalds
  2010-04-16  6:09                                                                                                                                                             ` Felipe Balbi
  3 siblings, 1 reply; 242+ messages in thread
From: Linus Torvalds @ 2010-04-15 20:01 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Borislav Petkov, Johannes Weiner, KOSAKI Motohiro, Andrew Morton,
	Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn,
	Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson



On Wed, 14 Apr 2010, Rik van Riel wrote:
> -	/*
> -	 * We must use the _oldest_ possible anon_vma for the page mapping!
> -	 *
> -	 * So take the last AVC chain entry in the vma, which is the deepest
> -	 * ancestor, and use the anon_vma from that.
> -	 */
> -	avc = list_entry(vma->anon_vma_chain.prev, struct anon_vma_chain, same_vma);
> -	anon_vma = avc->anon_vma;
> +	if (exclusive)
> +		anon_vma = vma->anon_vma;
> +	else {
> +		/*
> +		 * The page may be shared between multiple processes.
> +		 * We must use the _oldest_ possible anon_vma for the
> +		 * page mapping!  That anon_vma is guaranteed to be
> +		 * present in all processes that could share this page.
> +		 *
> +		 * So take the last AVC chain entry in the vma, which is the
> +		 * deepest ancestor, and use the anon_vma from that.
> +		 */
> +		avc = list_entry(vma->anon_vma_chain.prev, struct anon_vma_chain, same_vma);
> +		anon_vma = avc->anon_vma;
> +	}

I really dislike your coding style.

If we do this conditionally, we're _much_ better off declaring the 
variables we only use inside that conditional block inside the block 
itself. And since we access "vma->anon_vma" in either case, just move that 
case outside the conditional statement, and avoid a pointless 
if/then/else.

IOW, something like this. Totally untested.

		Linus

---
 mm/rmap.c |   26 +++++++++++++++-----------
 1 files changed, 15 insertions(+), 11 deletions(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index 4bad326..78d4730 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -732,21 +732,25 @@ void page_move_anon_rmap(struct page *page,
  * @address:	the user virtual address mapped
  */
 static void __page_set_anon_rmap(struct page *page,
-	struct vm_area_struct *vma, unsigned long address)
+	struct vm_area_struct *vma, unsigned long address, int exclusive)
 {
-	struct anon_vma_chain *avc;
-	struct anon_vma *anon_vma;
+	struct anon_vma *anon_vma = vma->anon_vma;
 
-	BUG_ON(!vma->anon_vma);
+	BUG_ON(!anon_vma);
 
 	/*
-	 * We must use the _oldest_ possible anon_vma for the page mapping!
+	 * If the page isn't exclusively mapped into this vma,
+	 * we must use the _oldest_ possible anon_vma for the
+	 * page mapping!
 	 *
-	 * So take the last AVC chain entry in the vma, which is the deepest
-	 * ancestor, and use the anon_vma from that.
+	 * So take the last AVC chain entry in the vma, which is
+	 * the deepest ancestor, and use the anon_vma from that.
 	 */
-	avc = list_entry(vma->anon_vma_chain.prev, struct anon_vma_chain, same_vma);
-	anon_vma = avc->anon_vma;
+	if (!exclusive) {
+		struct anon_vma_chain *avc;
+		avc = list_entry(vma->anon_vma_chain.prev, struct anon_vma_chain, same_vma);
+		anon_vma = avc->anon_vma;
+	}
 
 	anon_vma = (void *) anon_vma + PAGE_MAPPING_ANON;
 	page->mapping = (struct address_space *) anon_vma;
@@ -802,7 +806,7 @@ void page_add_anon_rmap(struct page *page,
 	VM_BUG_ON(!PageLocked(page));
 	VM_BUG_ON(address < vma->vm_start || address >= vma->vm_end);
 	if (first)
-		__page_set_anon_rmap(page, vma, address);
+		__page_set_anon_rmap(page, vma, address, 0);
 	else
 		__page_check_anon_rmap(page, vma, address);
 }
@@ -824,7 +828,7 @@ void page_add_new_anon_rmap(struct page *page,
 	SetPageSwapBacked(page);
 	atomic_set(&page->_mapcount, 0); /* increment count (starts at -1) */
 	__inc_zone_page_state(page, NR_ANON_PAGES);
-	__page_set_anon_rmap(page, vma, address);
+	__page_set_anon_rmap(page, vma, address, 1);
 	if (page_evictable(page, vma))
 		lru_cache_add_lru(page, LRU_ACTIVE_ANON);
 	else

^ permalink raw reply related	[flat|nested] 242+ messages in thread

* Re: [PATCH] rmap: add exclusively owned pages to the newest anon_vma
  2010-04-15 20:01                                                                                                                                                           ` Linus Torvalds
@ 2010-04-16  6:09                                                                                                                                                             ` Felipe Balbi
  2010-04-16 14:48                                                                                                                                                               ` Linus Torvalds
  0 siblings, 1 reply; 242+ messages in thread
From: Felipe Balbi @ 2010-04-16  6:09 UTC (permalink / raw)
  To: ext Linus Torvalds
  Cc: Rik van Riel, Borislav Petkov, Johannes Weiner, KOSAKI Motohiro,
	Andrew Morton, Minchan Kim, Linux Kernel Mailing List,
	Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins,
	sgunderson

Hi,

On Thu, Apr 15, 2010 at 10:01:11PM +0200, ext Linus Torvalds wrote:
>+		avc = list_entry(vma->anon_vma_chain.prev, struct anon_vma_chain, same_vma);

while at that, would it make sense to first provide list_last_entry() 
since we already have list_first_entry() ??

totally unrelated to this patch, sorry

-- 
balbi

^ permalink raw reply	[flat|nested] 242+ messages in thread

* Re: [PATCH] rmap: add exclusively owned pages to the newest anon_vma
  2010-04-16  6:09                                                                                                                                                             ` Felipe Balbi
@ 2010-04-16 14:48                                                                                                                                                               ` Linus Torvalds
  0 siblings, 0 replies; 242+ messages in thread
From: Linus Torvalds @ 2010-04-16 14:48 UTC (permalink / raw)
  To: Felipe Balbi
  Cc: Rik van Riel, Borislav Petkov, Johannes Weiner, KOSAKI Motohiro,
	Andrew Morton, Minchan Kim, Linux Kernel Mailing List,
	Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins,
	sgunderson



On Fri, 16 Apr 2010, Felipe Balbi wrote:
> 
> while at that, would it make sense to first provide list_last_entry() since we
> already have list_first_entry() ??

Yeah, it probably would make sense. Especially as doing a simple grep for 
'list_entry.*prev' does seem to imply that there might be quite a few 
places that would be able to use it. Although some of them do seem to be 
about finding the previous entry rather than the last in a list.

That said, doing the same grep for 'next' shows that a lot of places don't 
use the list_first_entry() that we _do_ have, so..

		Linus

^ permalink raw reply	[flat|nested] 242+ messages in thread

end of thread, other threads:[~2010-04-16 14:53 UTC | newest]

Thread overview: 242+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-03-30 17:50 Linux 2.6.34-rc3 Linus Torvalds
2010-03-30 21:16 ` [Regression, post-rc2] Commit a5ee4eb7541 breaks OpenGL on RS780 (was: Re: Linux 2.6.34-rc3) Rafael J. Wysocki
2010-03-31 20:34   ` [stable] " Greg KH
2010-04-01  1:13   ` Rafael J. Wysocki
2010-04-01  2:19     ` Alex Deucher
2010-04-01  2:19       ` Alex Deucher
2010-04-01  6:36       ` Clemens Ladisch
2010-04-01 15:01         ` Alex Deucher
2010-04-01 15:01           ` Alex Deucher
2010-04-01 20:28           ` Rafael J. Wysocki
2010-04-01 20:28             ` Rafael J. Wysocki
2010-04-01 20:39             ` Alex Deucher
2010-04-01 20:39               ` Alex Deucher
2010-04-01 20:48               ` Rafael J. Wysocki
2010-04-01 21:00                 ` Alex Deucher
2010-04-01 21:00                   ` Alex Deucher
2010-04-01 21:01                 ` Alex Deucher
2010-04-01 21:01                   ` Alex Deucher
2010-04-01 21:08                   ` Rafael J. Wysocki
2010-04-01 21:13                     ` Alex Deucher
2010-04-01 21:13                       ` Alex Deucher
2010-04-01 21:46                       ` Rafael J. Wysocki
2010-04-01 22:07                         ` Alex Deucher
2010-04-01 22:07                           ` Alex Deucher
2010-04-01 23:20                           ` Rafael J. Wysocki
2010-04-02  0:23                             ` Linus Torvalds
2010-04-02 16:46                               ` Rafael J. Wysocki
2010-04-03 18:08                                 ` Clemens Ladisch
2010-04-03 19:33                                   ` Rafael J. Wysocki
2010-04-01 16:29     ` Linus Torvalds
2010-04-01 17:07       ` Alex Deucher
2010-04-01 17:07         ` Alex Deucher
2010-04-01 17:24         ` Linus Torvalds
2010-04-01 17:50           ` [Regression, post-rc2] Commit a5ee4eb7541 breaks OpenGL on RS780 Clemens Ladisch
2010-04-01 17:53           ` [Regression, post-rc2] Commit a5ee4eb7541 breaks OpenGL on RS780 (was: Re: Linux 2.6.34-rc3) Alex Deucher
2010-04-01 17:53             ` Alex Deucher
2010-04-01 20:17             ` Linus Torvalds
2010-04-01 20:23               ` Alex Deucher
2010-04-01 20:23                 ` Alex Deucher
2010-04-01 19:46       ` Rafael J. Wysocki
2010-04-01 22:48       ` Jesse Barnes
2010-04-01 23:23         ` Rafael J. Wysocki
2010-04-02 17:59 ` Ugly rmap NULL ptr deref oopsie on hibernate (was " Borislav Petkov
2010-04-02 18:09   ` Linus Torvalds
2010-04-02 15:24     ` Andrew Morton
2010-04-02 18:37       ` Linus Torvalds
2010-04-02 22:01         ` Rik van Riel
2010-04-03  0:19           ` Linus Torvalds
2010-04-04 16:12           ` Minchan Kim
2010-04-04 17:24             ` Rik van Riel
2010-04-04 23:09             ` [PATCH] rmap: fix anon_vma_fork() memory leak Rik van Riel
2010-04-04 23:56               ` Minchan Kim
2010-04-05 15:37               ` Linus Torvalds
2010-04-05 15:48                 ` Minchan Kim
2010-04-05 16:04                 ` Rik van Riel
2010-04-05 16:13                 ` [PATCH -v2] " Rik van Riel
2010-04-06  8:53     ` Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3) KOSAKI Motohiro
2010-04-06 10:09       ` KOSAKI Motohiro
2010-04-06 14:34         ` Rik van Riel
2010-04-06 14:38       ` Rik van Riel
2010-04-06 15:34         ` Minchan Kim
2010-04-06 15:40           ` Rik van Riel
2010-04-06 15:58             ` Minchan Kim
2010-04-06 15:55           ` Linus Torvalds
2010-04-06 16:23             ` Minchan Kim
2010-04-06 16:28               ` Linus Torvalds
2010-04-06 16:45                 ` Minchan Kim
2010-04-06 16:53                   ` Linus Torvalds
2010-04-06 17:04                     ` Rik van Riel
2010-04-06 18:28                       ` Linus Torvalds
2010-04-06 19:03                         ` Andrew Morton
2010-04-06 19:10                           ` Steinar H. Gunderson
2010-04-06 19:10                           ` Linus Torvalds
2010-04-06 19:35                             ` Linus Torvalds
2010-04-06 19:42                           ` Borislav Petkov
2010-04-06 20:02                             ` Linus Torvalds
2010-04-06 20:46                               ` Steinar H. Gunderson
2010-04-06 20:56                                 ` Linus Torvalds
2010-04-06 21:05                                   ` Steinar H. Gunderson
2010-04-06 20:51                               ` Borislav Petkov
2010-04-06 21:27                                 ` Linus Torvalds
2010-04-06 22:59                                   ` Borislav Petkov
2010-04-06 23:27                                     ` Linus Torvalds
2010-04-06 23:54                                       ` [PATCH] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA Rik van Riel
2010-04-07  7:00                                         ` KOSAKI Motohiro
2010-04-07 14:48                                           ` Rik van Riel
2010-04-07 14:54                                           ` [PATCH -v2] " Rik van Riel
2010-04-07 15:30                                             ` Linus Torvalds
2010-04-07 15:52                                               ` Rik van Riel
2010-04-07 16:56                                                 ` Linus Torvalds
2010-04-07 21:19                                                   ` Linus Torvalds
2010-04-07 21:52                                                     ` Rik van Riel
2010-04-07 22:09                                                       ` Linus Torvalds
2010-04-07 22:15                                                         ` Linus Torvalds
2010-04-08  0:38                                                           ` Rik van Riel
2010-04-07 23:37                                                         ` Linus Torvalds
2010-04-08  2:03                                                           ` KOSAKI Motohiro
2010-04-08  2:33                                                             ` Linus Torvalds
2010-04-08  5:47                                                               ` Borislav Petkov
2010-04-08 14:11                                                                 ` Linus Torvalds
2010-04-08 18:25                                                                   ` Rik van Riel
2010-04-08 18:32                                                                     ` Linus Torvalds
2010-04-08 20:31                                                                       ` Borislav Petkov
2010-04-08 21:00                                                                   ` Borislav Petkov
2010-04-08 23:16                                                                     ` Linus Torvalds
2010-04-08 23:47                                                                       ` Borislav Petkov
2010-04-09  0:50                                                                         ` Linus Torvalds
2010-04-09  1:30                                                                           ` Borislav Petkov
2010-04-09  9:21                                                                             ` Borislav Petkov
2010-04-09 16:35                                                                               ` Linus Torvalds
2010-04-09 17:40                                                                                 ` Borislav Petkov
2010-04-09 17:50                                                                                   ` Linus Torvalds
2010-04-09 19:14                                                                                     ` Borislav Petkov
2010-04-09 19:32                                                                                       ` Linus Torvalds
2010-04-09 20:03                                                                                         ` Rik van Riel
2010-04-09 20:43                                                                                         ` Johannes Weiner
2010-04-09 20:57                                                                                           ` Rik van Riel
2010-04-09 21:33                                                                                           ` Borislav Petkov
2010-04-09 23:22                                                                                           ` Linus Torvalds
2010-04-09 23:45                                                                                             ` Rik van Riel
2010-04-10  0:03                                                                                               ` Linus Torvalds
2010-04-10  0:11                                                                                                 ` Rik van Riel
2010-04-09 23:54                                                                                             ` Johannes Weiner
2010-04-09 23:56                                                                                             ` Linus Torvalds
2010-04-10  0:19                                                                                               ` Rik van Riel
2010-04-10  0:31                                                                                               ` Johannes Weiner
2010-04-10  0:32                                                                                                 ` Linus Torvalds
2010-04-10  7:27                                                                                                   ` Borislav Petkov
2010-04-10 11:26                                                                                                     ` Borislav Petkov
2010-04-10 14:45                                                                                                       ` Rik van Riel
2010-04-10 15:24                                                                                                       ` Linus Torvalds
2010-04-10 16:38                                                                                                         ` Borislav Petkov
2010-04-10 17:05                                                                                                           ` Linus Torvalds
2010-04-10 18:21                                                                                                             ` Linus Torvalds
2010-04-10 18:26                                                                                                               ` Linus Torvalds
2010-04-10 18:51                                                                                                               ` Borislav Petkov
2010-04-10 18:58                                                                                                                 ` Borislav Petkov
2010-04-10 20:05                                                                                                                   ` Linus Torvalds
2010-04-10 20:12                                                                                                                     ` Linus Torvalds
2010-04-10 20:36                                                                                                                       ` Borislav Petkov
2010-04-10 20:40                                                                                                                         ` Linus Torvalds
2010-04-10 21:25                                                                                                                           ` Borislav Petkov
2010-04-10 21:30                                                                                                                             ` Linus Torvalds
2010-04-10 21:51                                                                                                                               ` Borislav Petkov
2010-04-11 13:08                                                                                                                                 ` Borislav Petkov
2010-04-11 13:19                                                                                                                                   ` [PATCH 1/3] mm: make page freeing path RCU-safe Borislav Petkov
2010-04-11 13:19                                                                                                                                   ` [PATCH 2/3] mm: cleanup find_mergeable_anon_vma complexity Borislav Petkov
2010-04-11 13:19                                                                                                                                   ` [PATCH 3/3] mm: fixup vma_adjust Borislav Petkov
2010-04-11 13:25                                                                                                                                   ` [PATCH 2/3] mm: cleanup find_mergeable_anon_vma complexity Borislav Petkov
2010-04-11 17:07                                                                                                                                   ` [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA Linus Torvalds
2010-04-11 17:16                                                                                                                                     ` Linus Torvalds
2010-04-11 18:55                                                                                                                                       ` Borislav Petkov
2010-04-12  0:13                                                                                                                                         ` Linus Torvalds
2010-04-12  1:04                                                                                                                                           ` Linus Torvalds
2010-04-12  7:20                                                                                                                                             ` Borislav Petkov
2010-04-12 16:02                                                                                                                                               ` Linus Torvalds
2010-04-12 16:26                                                                                                                                                 ` Linus Torvalds
2010-04-12 18:40                                                                                                                                                   ` Rik van Riel
2010-04-12 19:00                                                                                                                                                     ` Borislav Petkov
2010-04-12 19:17                                                                                                                                                       ` Linus Torvalds
2010-04-12 20:22                                                                                                                                                         ` [PATCH 1/4] Simplify and comment on anon_vma re-use for anon_vma_prepare() Linus Torvalds
2010-04-12 20:23                                                                                                                                                           ` [PATCH 2/4] vma_adjust: fix the copying of anon_vma chains Linus Torvalds
2010-04-12 20:23                                                                                                                                                             ` [PATCH 3/4] anon_vma: clone the anon_vma chain in the right order Linus Torvalds
2010-04-12 20:23                                                                                                                                                               ` [PATCH 4/4] anonvma: when setting up page->mapping, we need to pick the _oldest_ anonvma Linus Torvalds
2010-04-12 21:03                                                                                                                                                                 ` Rik van Riel
2010-04-13  0:41                                                                                                                                                                 ` Johannes Weiner
2010-04-13  1:08                                                                                                                                                                   ` Linus Torvalds
2010-04-13  4:23                                                                                                                                                                     ` Minchan Kim
2010-04-13  4:26                                                                                                                                                                       ` Minchan Kim
2010-04-12 20:57                                                                                                                                                               ` [PATCH 3/4] anon_vma: clone the anon_vma chain in the right order Rik van Riel
2010-04-13  0:18                                                                                                                                                               ` Johannes Weiner
2010-04-13  4:16                                                                                                                                                               ` Minchan Kim
2010-04-12 20:54                                                                                                                                                             ` [PATCH 2/4] vma_adjust: fix the copying of anon_vma chains Rik van Riel
2010-04-12 23:59                                                                                                                                                             ` Johannes Weiner
2010-04-13  4:15                                                                                                                                                             ` Minchan Kim
2010-04-12 20:54                                                                                                                                                           ` [PATCH 1/4] Simplify and comment on anon_vma re-use for anon_vma_prepare() Rik van Riel
2010-04-12 23:54                                                                                                                                                           ` Johannes Weiner
2010-04-13  4:04                                                                                                                                                           ` Minchan Kim
2010-04-13  9:51                                                                                                                                                           ` Peter Zijlstra
2010-04-12 21:50                                                                                                                                                   ` [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA Borislav Petkov
2010-04-12 22:11                                                                                                                                                     ` Linus Torvalds
2010-04-12 22:18                                                                                                                                                       ` Linus Torvalds
2010-04-12 22:29                                                                                                                                                         ` Borislav Petkov
2010-04-13  9:38                                                                                                                                                       ` Borislav Petkov
2010-04-14 21:59                                                                                                                                                         ` [PATCH] rmap: add exclusively owned pages to the newest anon_vma Rik van Riel
2010-04-14 23:20                                                                                                                                                           ` Johannes Weiner
2010-04-15  8:34                                                                                                                                                           ` Borislav Petkov
2010-04-15 16:02                                                                                                                                                           ` Minchan Kim
2010-04-15 20:01                                                                                                                                                           ` Linus Torvalds
2010-04-16  6:09                                                                                                                                                             ` Felipe Balbi
2010-04-16 14:48                                                                                                                                                               ` Linus Torvalds
2010-04-11 19:49                                                                                                                                       ` [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA Rik van Riel
2010-04-12 15:44                                                                                                                                         ` Linus Torvalds
2010-04-12 15:51                                                                                                                                           ` Rik van Riel
2010-04-11 21:45                                                                                                                                       ` Rik van Riel
2010-04-12 15:51                                                                                                                                         ` Linus Torvalds
2010-04-13 10:36                                                                                                                                           ` KOSAKI Motohiro
2010-04-10 20:24                                                                                                                     ` Rik van Riel
2010-04-10 20:34                                                                                                                       ` Linus Torvalds
2010-04-10 20:43                                                                                                                         ` Rik van Riel
2010-04-10 20:32                                                                                                                     ` Rik van Riel
2010-04-10 19:36                                                                                                               ` Rik van Riel
2010-04-12 14:40                                                                                                               ` Peter Zijlstra
2010-04-12 15:17                                                                                                                 ` Minchan Kim
2010-04-12 15:33                                                                                                                   ` Peter Zijlstra
2010-04-12 15:19                                                                                                                 ` Rik van Riel
2010-04-12 16:01                                                                                                                   ` Peter Zijlstra
2010-04-12 16:06                                                                                                                     ` Rik van Riel
2010-04-12 16:46                                                                                                                       ` Linus Torvalds
2010-04-12 18:40                                                                                                                         ` Peter Zijlstra
2010-04-12 19:30                                                                                                                           ` Peter Zijlstra
2010-04-12 19:44                                                                                                                             ` Peter Zijlstra
2010-04-13 10:53                                                                                                                     ` KOSAKI Motohiro
2010-04-13 11:30                                                                                                                       ` Peter Zijlstra
2010-04-13 12:00                                                                                                                         ` KOSAKI Motohiro
2010-04-14 14:27                                                                                                                           ` Peter Zijlstra
2010-04-10 17:07                                                                                                           ` Borislav Petkov
2010-04-10 16:41                                                                                                         ` Linus Torvalds
2010-04-10 22:49                                                                                                           ` Johannes Weiner
2010-04-10 23:31                                                                                                             ` Linus Torvalds
2010-04-09  1:45                                                                           ` KOSAKI Motohiro
2010-04-07 15:55                                             ` Minchan Kim
2010-04-07  7:29                                       ` Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3) Borislav Petkov
2010-04-07 14:05                                       ` Paulo Marques
2010-04-07 14:13                                         ` Borislav Petkov
2010-04-06 23:37                                     ` Linus Torvalds
2010-04-06 23:22                                   ` Rik van Riel
2010-04-07  0:10                                     ` Linus Torvalds
2010-04-07  1:18                                       ` Rik van Riel
2010-04-07  7:22                                         ` Borislav Petkov
2010-04-07 10:09                                       ` Pekka Enberg
2010-04-07 10:12                                         ` KOSAKI Motohiro
2010-04-07  8:41                               ` Peter Zijlstra
2010-04-07  8:36                         ` Peter Zijlstra
2010-04-07  9:16                           ` Johannes Weiner
2010-04-07  9:37                             ` Peter Zijlstra
2010-04-07 14:12                           ` Rik van Riel
2010-04-07 15:46                           ` Linus Torvalds
2010-04-06 16:32               ` Linus Torvalds
2010-04-06 16:54                 ` Minchan Kim
2010-04-07  8:37             ` Peter Zijlstra
2010-04-06 17:05         ` Borislav Petkov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.