All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 3.4 00/24] 3.4.81-stable review
@ 2014-02-18 22:46 Greg Kroah-Hartman
  2014-02-18 22:46 ` [PATCH 3.4 01/24] SELinux: Fix kernel BUG on empty security contexts Greg Kroah-Hartman
                   ` (25 more replies)
  0 siblings, 26 replies; 28+ messages in thread
From: Greg Kroah-Hartman @ 2014-02-18 22:46 UTC (permalink / raw)
  To: linux-kernel; +Cc: Greg Kroah-Hartman, torvalds, akpm, stable, lizefan

Many thanks to Li Zefan for digging up a bunch of these patches, that
work is much appreciated.

This is the start of the stable review cycle for the 3.4.81 release.
There are 24 patches in this series, all will be posted as a response
to this one.  If anyone has any issues with these being applied, please
let me know.

Responses should be made by Thu Feb 20 22:45:38 UTC 2014.
Anything received after that time might be too late.

The whole patch series can be found in one patch at:
	kernel.org/pub/linux/kernel/v3.0/stable-review/patch-3.4.81-rc1.gz
and the diffstat can be found below.

thanks,

greg k-h

-------------
Pseudo-Shortlog of commits:

Greg Kroah-Hartman <gregkh@linuxfoundation.org>
    Linux 3.4.81-rc1

Jeff Layton <jlayton@redhat.com>
    nfs: tear down caches in nfs_init_writepagecache when allocation fails

Dan Rosenberg <drosenberg@vsecurity.com>
    lib/vsprintf.c: kptr_restrict: fix pK-error in SysRq show-all-timers(Q)

Asias He <asias@redhat.com>
    virtio-blk: Use block layer provided spinlock

Seth Forshee <seth.forshee@canonical.com>
    Input: synaptics - handle out of bounds values from the hardware

Bojan Smojver <bojan@rexursive.com>
    PM / Hibernate: Hibernate/thaw fixes/improvements

Avi Kivity <avi@redhat.com>
    KVM: Fix buffer overflow in kvm_set_irq()

Nicholas Bellinger <nab@linux-iscsi.org>
    target/file: Re-enable optional fd_buffered_io=1 operation

Nicholas Bellinger <nab@linux-iscsi.org>
    target/file: Use O_DSYNC by default for FILEIO backends

Jan Kara <jack@suse.cz>
    IB/qib: Convert qib_user_sdma_pin_pages() to use get_user_pages_fast()

Peter Zijlstra <a.p.zijlstra@chello.nl>
    sched/nohz: Fix rq->cpu_load calculations some more

Peter Zijlstra <a.p.zijlstra@chello.nl>
    sched/nohz: Fix rq->cpu_load[] calculations

Steven Rostedt <rostedt@goodmis.org>
    ftrace: Have function graph only trace based on global_ops filters

Steven Rostedt <rostedt@goodmis.org>
    ftrace: Fix synchronization location disabling and freeing ftrace_ops

Steven Rostedt <rostedt@goodmis.org>
    ftrace: Synchronize setting function_trace_op with ftrace_trace_function

Mikulas Patocka <mpatocka@redhat.com>
    dm sysfs: fix a module unload race

Xishi Qiu <qiuxishi@huawei.com>
    mm: setup pageblock_order before it's used by sparsemem

Andrew Morton <akpm@linux-foundation.org>
    mm/page_alloc.c: remove pageblock_default_order()

Daniel Vetter <daniel.vetter@ffwll.ch>
    drm/i915: kick any firmware framebuffers before claiming the gtt

Tao Ma <boyu.mt@taobao.com>
    ext4: protect group inode free counting with group lock

Paul E. McKenney <paulmck@linux.vnet.ibm.com>
    printk: Fix scheduling-while-atomic problem in console_cpu_notify()

Peter Oberparleiter <oberpar@linux.vnet.ibm.com>
    x86, hweight: Fix BUG when booting with CONFIG_GCOV_PROFILE_ALL=y

KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
    mm: __set_page_dirty uses spin_lock_irqsave instead of spin_lock_irq

KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
    mm: __set_page_dirty_nobuffers() uses spin_lock_irqsave() instead of spin_lock_irq()

Stephen Smalley <sds@tycho.nsa.gov>
    SELinux: Fix kernel BUG on empty security contexts.


-------------

Diffstat:

 Makefile                                  |   4 +-
 drivers/block/virtio_blk.c                |   9 +--
 drivers/gpu/drm/i915/i915_dma.c           |  37 ++++++++--
 drivers/infiniband/hw/qib/qib_user_sdma.c |   6 +-
 drivers/input/mouse/synaptics.c           |  23 ++++++
 drivers/md/Kconfig                        |   4 ++
 drivers/md/Makefile                       |   1 +
 drivers/md/dm-builtin.c                   |  50 +++++++++++++
 drivers/md/dm-sysfs.c                     |   5 --
 drivers/md/dm.c                           |  26 ++-----
 drivers/md/dm.h                           |  17 ++++-
 drivers/target/target_core_file.c         |  81 ++++++++++-----------
 drivers/target/target_core_file.h         |   2 +-
 fs/buffer.c                               |   6 +-
 fs/ext4/ialloc.c                          |   4 +-
 fs/nfs/write.c                            |  10 ++-
 include/linux/sched.h                     |   1 +
 kernel/power/swap.c                       |  62 +++++++++++------
 kernel/printk.c                           |   1 -
 kernel/sched/core.c                       |  86 +++++++++++++++++++----
 kernel/sched/fair.c                       |   2 +-
 kernel/sched/sched.h                      |   2 +-
 kernel/time/tick-sched.c                  |   1 +
 kernel/trace/ftrace.c                     | 112 +++++++++++++++++++++++++-----
 lib/Makefile                              |   1 +
 lib/vsprintf.c                            |   3 +-
 mm/internal.h                             |   2 +
 mm/page-writeback.c                       |   5 +-
 mm/page_alloc.c                           |  33 ++++-----
 mm/sparse.c                               |   3 +
 security/selinux/ss/services.c            |   4 ++
 virt/kvm/irq_comm.c                       |   1 +
 32 files changed, 426 insertions(+), 178 deletions(-)



^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 3.4 01/24] SELinux:  Fix kernel BUG on empty security contexts.
  2014-02-18 22:46 [PATCH 3.4 00/24] 3.4.81-stable review Greg Kroah-Hartman
@ 2014-02-18 22:46 ` Greg Kroah-Hartman
  2014-02-18 22:46 ` [PATCH 3.4 02/24] mm: __set_page_dirty_nobuffers() uses spin_lock_irqsave() instead of spin_lock_irq() Greg Kroah-Hartman
                   ` (24 subsequent siblings)
  25 siblings, 0 replies; 28+ messages in thread
From: Greg Kroah-Hartman @ 2014-02-18 22:46 UTC (permalink / raw)
  To: linux-kernel
  Cc: Greg Kroah-Hartman, stable, Matthew Thode, Stephen Smalley, Paul Moore

3.4-stable review patch.  If anyone has any objections, please let me know.

------------------

From: Stephen Smalley <sds@tycho.nsa.gov>

commit 2172fa709ab32ca60e86179dc67d0857be8e2c98 upstream.

Setting an empty security context (length=0) on a file will
lead to incorrectly dereferencing the type and other fields
of the security context structure, yielding a kernel BUG.
As a zero-length security context is never valid, just reject
all such security contexts whether coming from userspace
via setxattr or coming from the filesystem upon a getxattr
request by SELinux.

Setting a security context value (empty or otherwise) unknown to
SELinux in the first place is only possible for a root process
(CAP_MAC_ADMIN), and, if running SELinux in enforcing mode, only
if the corresponding SELinux mac_admin permission is also granted
to the domain by policy.  In Fedora policies, this is only allowed for
specific domains such as livecd for setting down security contexts
that are not defined in the build host policy.

Reproducer:
su
setenforce 0
touch foo
setfattr -n security.selinux foo

Caveat:
Relabeling or removing foo after doing the above may not be possible
without booting with SELinux disabled.  Any subsequent access to foo
after doing the above will also trigger the BUG.

BUG output from Matthew Thode:
[  473.893141] ------------[ cut here ]------------
[  473.962110] kernel BUG at security/selinux/ss/services.c:654!
[  473.995314] invalid opcode: 0000 [#6] SMP
[  474.027196] Modules linked in:
[  474.058118] CPU: 0 PID: 8138 Comm: ls Tainted: G      D   I
3.13.0-grsec #1
[  474.116637] Hardware name: Supermicro X8ST3/X8ST3, BIOS 2.0
07/29/10
[  474.149768] task: ffff8805f50cd010 ti: ffff8805f50cd488 task.ti:
ffff8805f50cd488
[  474.183707] RIP: 0010:[<ffffffff814681c7>]  [<ffffffff814681c7>]
context_struct_compute_av+0xce/0x308
[  474.219954] RSP: 0018:ffff8805c0ac3c38  EFLAGS: 00010246
[  474.252253] RAX: 0000000000000000 RBX: ffff8805c0ac3d94 RCX:
0000000000000100
[  474.287018] RDX: ffff8805e8aac000 RSI: 00000000ffffffff RDI:
ffff8805e8aaa000
[  474.321199] RBP: ffff8805c0ac3cb8 R08: 0000000000000010 R09:
0000000000000006
[  474.357446] R10: 0000000000000000 R11: ffff8805c567a000 R12:
0000000000000006
[  474.419191] R13: ffff8805c2b74e88 R14: 00000000000001da R15:
0000000000000000
[  474.453816] FS:  00007f2e75220800(0000) GS:ffff88061fc00000(0000)
knlGS:0000000000000000
[  474.489254] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  474.522215] CR2: 00007f2e74716090 CR3: 00000005c085e000 CR4:
00000000000207f0
[  474.556058] Stack:
[  474.584325]  ffff8805c0ac3c98 ffffffff811b549b ffff8805c0ac3c98
ffff8805f1190a40
[  474.618913]  ffff8805a6202f08 ffff8805c2b74e88 00068800d0464990
ffff8805e8aac860
[  474.653955]  ffff8805c0ac3cb8 000700068113833a ffff880606c75060
ffff8805c0ac3d94
[  474.690461] Call Trace:
[  474.723779]  [<ffffffff811b549b>] ? lookup_fast+0x1cd/0x22a
[  474.778049]  [<ffffffff81468824>] security_compute_av+0xf4/0x20b
[  474.811398]  [<ffffffff8196f419>] avc_compute_av+0x2a/0x179
[  474.843813]  [<ffffffff8145727b>] avc_has_perm+0x45/0xf4
[  474.875694]  [<ffffffff81457d0e>] inode_has_perm+0x2a/0x31
[  474.907370]  [<ffffffff81457e76>] selinux_inode_getattr+0x3c/0x3e
[  474.938726]  [<ffffffff81455cf6>] security_inode_getattr+0x1b/0x22
[  474.970036]  [<ffffffff811b057d>] vfs_getattr+0x19/0x2d
[  475.000618]  [<ffffffff811b05e5>] vfs_fstatat+0x54/0x91
[  475.030402]  [<ffffffff811b063b>] vfs_lstat+0x19/0x1b
[  475.061097]  [<ffffffff811b077e>] SyS_newlstat+0x15/0x30
[  475.094595]  [<ffffffff8113c5c1>] ? __audit_syscall_entry+0xa1/0xc3
[  475.148405]  [<ffffffff8197791e>] system_call_fastpath+0x16/0x1b
[  475.179201] Code: 00 48 85 c0 48 89 45 b8 75 02 0f 0b 48 8b 45 a0 48
8b 3d 45 d0 b6 00 8b 40 08 89 c6 ff ce e8 d1 b0 06 00 48 85 c0 49 89 c7
75 02 <0f> 0b 48 8b 45 b8 4c 8b 28 eb 1e 49 8d 7d 08 be 80 01 00 00 e8
[  475.255884] RIP  [<ffffffff814681c7>]
context_struct_compute_av+0xce/0x308
[  475.296120]  RSP <ffff8805c0ac3c38>
[  475.328734] ---[ end trace f076482e9d754adc ]---

Reported-by:  Matthew Thode <mthode@mthode.org>
Signed-off-by: Stephen Smalley <sds@tycho.nsa.gov>
Signed-off-by: Paul Moore <pmoore@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

---
 security/selinux/ss/services.c |    4 ++++
 1 file changed, 4 insertions(+)

--- a/security/selinux/ss/services.c
+++ b/security/selinux/ss/services.c
@@ -1229,6 +1229,10 @@ static int security_context_to_sid_core(
 	struct context context;
 	int rc = 0;
 
+	/* An empty security context is never valid. */
+	if (!scontext_len)
+		return -EINVAL;
+
 	if (!ss_initialized) {
 		int i;
 



^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 3.4 02/24] mm: __set_page_dirty_nobuffers() uses spin_lock_irqsave() instead of spin_lock_irq()
  2014-02-18 22:46 [PATCH 3.4 00/24] 3.4.81-stable review Greg Kroah-Hartman
  2014-02-18 22:46 ` [PATCH 3.4 01/24] SELinux: Fix kernel BUG on empty security contexts Greg Kroah-Hartman
@ 2014-02-18 22:46 ` Greg Kroah-Hartman
  2014-02-18 22:46 ` [PATCH 3.4 03/24] mm: __set_page_dirty uses spin_lock_irqsave instead of spin_lock_irq Greg Kroah-Hartman
                   ` (23 subsequent siblings)
  25 siblings, 0 replies; 28+ messages in thread
From: Greg Kroah-Hartman @ 2014-02-18 22:46 UTC (permalink / raw)
  To: linux-kernel
  Cc: Greg Kroah-Hartman, stable, KOSAKI Motohiro, Larry Woodman,
	Rik van Riel, Johannes Weiner, David Rientjes, Andrew Morton,
	Linus Torvalds

3.4-stable review patch.  If anyone has any objections, please let me know.

------------------

From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

commit a85d9df1ea1d23682a0ed1e100e6965006595d06 upstream.

During aio stress test, we observed the following lockdep warning.  This
mean AIO+numa_balancing is currently deadlockable.

The problem is, aio_migratepage disable interrupt, but
__set_page_dirty_nobuffers unintentionally enable it again.

Generally, all helper function should use spin_lock_irqsave() instead of
spin_lock_irq() because they don't know caller at all.

   other info that might help us debug this:
    Possible unsafe locking scenario:

          CPU0
          ----
     lock(&(&ctx->completion_lock)->rlock);
     <Interrupt>
       lock(&(&ctx->completion_lock)->rlock);

    *** DEADLOCK ***

      dump_stack+0x19/0x1b
      print_usage_bug+0x1f7/0x208
      mark_lock+0x21d/0x2a0
      mark_held_locks+0xb9/0x140
      trace_hardirqs_on_caller+0x105/0x1d0
      trace_hardirqs_on+0xd/0x10
      _raw_spin_unlock_irq+0x2c/0x50
      __set_page_dirty_nobuffers+0x8c/0xf0
      migrate_page_copy+0x434/0x540
      aio_migratepage+0xb1/0x140
      move_to_new_page+0x7d/0x230
      migrate_pages+0x5e5/0x700
      migrate_misplaced_page+0xbc/0xf0
      do_numa_page+0x102/0x190
      handle_pte_fault+0x241/0x970
      handle_mm_fault+0x265/0x370
      __do_page_fault+0x172/0x5a0
      do_page_fault+0x1a/0x70
      page_fault+0x28/0x30

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Larry Woodman <lwoodman@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

---
 mm/page-writeback.c |    5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1993,11 +1993,12 @@ int __set_page_dirty_nobuffers(struct pa
 	if (!TestSetPageDirty(page)) {
 		struct address_space *mapping = page_mapping(page);
 		struct address_space *mapping2;
+		unsigned long flags;
 
 		if (!mapping)
 			return 1;
 
-		spin_lock_irq(&mapping->tree_lock);
+		spin_lock_irqsave(&mapping->tree_lock, flags);
 		mapping2 = page_mapping(page);
 		if (mapping2) { /* Race with truncate? */
 			BUG_ON(mapping2 != mapping);
@@ -2006,7 +2007,7 @@ int __set_page_dirty_nobuffers(struct pa
 			radix_tree_tag_set(&mapping->page_tree,
 				page_index(page), PAGECACHE_TAG_DIRTY);
 		}
-		spin_unlock_irq(&mapping->tree_lock);
+		spin_unlock_irqrestore(&mapping->tree_lock, flags);
 		if (mapping->host) {
 			/* !PageAnon && !swapper_space */
 			__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);



^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 3.4 03/24] mm: __set_page_dirty uses spin_lock_irqsave instead of spin_lock_irq
  2014-02-18 22:46 [PATCH 3.4 00/24] 3.4.81-stable review Greg Kroah-Hartman
  2014-02-18 22:46 ` [PATCH 3.4 01/24] SELinux: Fix kernel BUG on empty security contexts Greg Kroah-Hartman
  2014-02-18 22:46 ` [PATCH 3.4 02/24] mm: __set_page_dirty_nobuffers() uses spin_lock_irqsave() instead of spin_lock_irq() Greg Kroah-Hartman
@ 2014-02-18 22:46 ` Greg Kroah-Hartman
  2014-02-18 22:46 ` [PATCH 3.4 04/24] x86, hweight: Fix BUG when booting with CONFIG_GCOV_PROFILE_ALL=y Greg Kroah-Hartman
                   ` (22 subsequent siblings)
  25 siblings, 0 replies; 28+ messages in thread
From: Greg Kroah-Hartman @ 2014-02-18 22:46 UTC (permalink / raw)
  To: linux-kernel
  Cc: Greg Kroah-Hartman, stable, KOSAKI Motohiro, Andrew Morton,
	Linus Torvalds

3.4-stable review patch.  If anyone has any objections, please let me know.

------------------

From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

commit 227d53b397a32a7614667b3ecaf1d89902fb6c12 upstream.

To use spin_{un}lock_irq is dangerous if caller disabled interrupt.
During aio buffer migration, we have a possibility to see the following
call stack.

aio_migratepage  [disable interrupt]
  migrate_page_copy
    clear_page_dirty_for_io
      set_page_dirty
        __set_page_dirty_buffers
          __set_page_dirty
            spin_lock_irq

This mean, current aio migration is a deadlockable.  spin_lock_irqsave
is a safer alternative and we should use it.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reported-by: David Rientjes rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

---
 fs/buffer.c |    6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -613,14 +613,16 @@ EXPORT_SYMBOL(mark_buffer_dirty_inode);
 static void __set_page_dirty(struct page *page,
 		struct address_space *mapping, int warn)
 {
-	spin_lock_irq(&mapping->tree_lock);
+	unsigned long flags;
+
+	spin_lock_irqsave(&mapping->tree_lock, flags);
 	if (page->mapping) {	/* Race with truncate? */
 		WARN_ON_ONCE(warn && !PageUptodate(page));
 		account_page_dirtied(page, mapping);
 		radix_tree_tag_set(&mapping->page_tree,
 				page_index(page), PAGECACHE_TAG_DIRTY);
 	}
-	spin_unlock_irq(&mapping->tree_lock);
+	spin_unlock_irqrestore(&mapping->tree_lock, flags);
 	__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
 }
 



^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 3.4 04/24] x86, hweight: Fix BUG when booting with CONFIG_GCOV_PROFILE_ALL=y
  2014-02-18 22:46 [PATCH 3.4 00/24] 3.4.81-stable review Greg Kroah-Hartman
                   ` (2 preceding siblings ...)
  2014-02-18 22:46 ` [PATCH 3.4 03/24] mm: __set_page_dirty uses spin_lock_irqsave instead of spin_lock_irq Greg Kroah-Hartman
@ 2014-02-18 22:46 ` Greg Kroah-Hartman
  2014-02-18 22:46 ` [PATCH 3.4 05/24] printk: Fix scheduling-while-atomic problem in console_cpu_notify() Greg Kroah-Hartman
                   ` (21 subsequent siblings)
  25 siblings, 0 replies; 28+ messages in thread
From: Greg Kroah-Hartman @ 2014-02-18 22:46 UTC (permalink / raw)
  To: linux-kernel
  Cc: Greg Kroah-Hartman, stable, Meelis Roos, Andrew Morton,
	Peter Oberparleiter, H. Peter Anvin

3.4-stable review patch.  If anyone has any objections, please let me know.

------------------

From: Peter Oberparleiter <oberpar@linux.vnet.ibm.com>

commit 6583327c4dd55acbbf2a6f25e775b28b3abf9a42 upstream.

Commit d61931d89b, "x86: Add optimized popcnt variants" introduced
compile flag -fcall-saved-rdi for lib/hweight.c. When combined with
options -fprofile-arcs and -O2, this flag causes gcc to generate
broken constructor code. As a result, a 64 bit x86 kernel compiled
with CONFIG_GCOV_PROFILE_ALL=y prints message "gcov: could not create
file" and runs into sproadic BUGs during boot.

The gcc people indicate that these kinds of problems are endemic when
using ad hoc calling conventions.  It is therefore best to treat any
file compiled with ad hoc calling conventions as an isolated
environment and avoid things like profiling or coverage analysis,
since those subsystems assume a "normal" calling conventions.

This patch avoids the bug by excluding lib/hweight.o from coverage
profiling.

Reported-by: Meelis Roos <mroos@linux.ee>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Peter Oberparleiter <oberpar@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/r/52F3A30C.7050205@linux.vnet.ibm.com
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

---
 lib/Makefile |    1 +
 1 file changed, 1 insertion(+)

--- a/lib/Makefile
+++ b/lib/Makefile
@@ -41,6 +41,7 @@ obj-$(CONFIG_DEBUG_SPINLOCK) += spinlock
 lib-$(CONFIG_RWSEM_GENERIC_SPINLOCK) += rwsem-spinlock.o
 lib-$(CONFIG_RWSEM_XCHGADD_ALGORITHM) += rwsem.o
 
+GCOV_PROFILE_hweight.o := n
 CFLAGS_hweight.o = $(subst $(quote),,$(CONFIG_ARCH_HWEIGHT_CFLAGS))
 obj-$(CONFIG_GENERIC_HWEIGHT) += hweight.o
 



^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 3.4 05/24] printk: Fix scheduling-while-atomic problem in console_cpu_notify()
  2014-02-18 22:46 [PATCH 3.4 00/24] 3.4.81-stable review Greg Kroah-Hartman
                   ` (3 preceding siblings ...)
  2014-02-18 22:46 ` [PATCH 3.4 04/24] x86, hweight: Fix BUG when booting with CONFIG_GCOV_PROFILE_ALL=y Greg Kroah-Hartman
@ 2014-02-18 22:46 ` Greg Kroah-Hartman
  2014-02-18 22:46 ` [PATCH 3.4 06/24] ext4: protect group inode free counting with group lock Greg Kroah-Hartman
                   ` (20 subsequent siblings)
  25 siblings, 0 replies; 28+ messages in thread
From: Greg Kroah-Hartman @ 2014-02-18 22:46 UTC (permalink / raw)
  To: linux-kernel
  Cc: Greg Kroah-Hartman, stable, Paul E. McKenney, Srivatsa S. Bhat,
	Linus Torvalds, Guillaume Morin

3.4-stable review patch.  If anyone has any objections, please let me know.

------------------

From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>

commit 85eae82a0855d49852b87deac8653e4ebc8b291f upstream.

The console_cpu_notify() function runs with interrupts disabled in the
CPU_DYING case.  It therefore cannot block, for example, as will happen
when it calls console_lock().  Therefore, remove the CPU_DYING leg of
the switch statement to avoid this problem.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Guillaume Morin <guillaume@morinfr.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

---
 kernel/printk.c |    1 -
 1 file changed, 1 deletion(-)

--- a/kernel/printk.c
+++ b/kernel/printk.c
@@ -1172,7 +1172,6 @@ static int __cpuinit console_cpu_notify(
 	switch (action) {
 	case CPU_ONLINE:
 	case CPU_DEAD:
-	case CPU_DYING:
 	case CPU_DOWN_FAILED:
 	case CPU_UP_CANCELED:
 		console_lock();



^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 3.4 06/24] ext4: protect group inode free counting with group lock
  2014-02-18 22:46 [PATCH 3.4 00/24] 3.4.81-stable review Greg Kroah-Hartman
                   ` (4 preceding siblings ...)
  2014-02-18 22:46 ` [PATCH 3.4 05/24] printk: Fix scheduling-while-atomic problem in console_cpu_notify() Greg Kroah-Hartman
@ 2014-02-18 22:46 ` Greg Kroah-Hartman
  2014-02-18 22:46 ` [PATCH 3.4 07/24] drm/i915: kick any firmware framebuffers before claiming the gtt Greg Kroah-Hartman
                   ` (19 subsequent siblings)
  25 siblings, 0 replies; 28+ messages in thread
From: Greg Kroah-Hartman @ 2014-02-18 22:46 UTC (permalink / raw)
  To: linux-kernel
  Cc: Greg Kroah-Hartman, stable, Tao Ma, Theodore Tso, Benjamin LaHaise

3.4-stable review patch.  If anyone has any objections, please let me know.

------------------

From: Tao Ma <boyu.mt@taobao.com>

commit 6f2e9f0e7d795214b9cf5a47724a273b705fd113 upstream.

Now when we set the group inode free count, we don't have a proper
group lock so that multiple threads may decrease the inode free
count at the same time. And e2fsck will complain something like:

Free inodes count wrong for group #1 (1, counted=0).
Fix? no

Free inodes count wrong for group #2 (3, counted=0).
Fix? no

Directories count wrong for group #2 (780, counted=779).
Fix? no

Free inodes count wrong for group #3 (2272, counted=2273).
Fix? no

So this patch try to protect it with the ext4_lock_group.

btw, it is found by xfstests test case 269 and the volume is
mkfsed with the parameter
"-O ^resize_inode,^uninit_bg,extent,meta_bg,flex_bg,ext_attr"
and I have run it 100 times and the error in e2fsck doesn't
show up again.

Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


---
 fs/ext4/ialloc.c |    4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

--- a/fs/ext4/ialloc.c
+++ b/fs/ext4/ialloc.c
@@ -778,6 +778,8 @@ got:
 			ext4_itable_unused_set(sb, gdp,
 					(EXT4_INODES_PER_GROUP(sb) - ino));
 		up_read(&grp->alloc_sem);
+	} else {
+		ext4_lock_group(sb, group);
 	}
 	ext4_free_inodes_set(sb, gdp, ext4_free_inodes_count(sb, gdp) - 1);
 	if (S_ISDIR(mode)) {
@@ -790,8 +792,8 @@ got:
 	}
 	if (EXT4_HAS_RO_COMPAT_FEATURE(sb, EXT4_FEATURE_RO_COMPAT_GDT_CSUM)) {
 		gdp->bg_checksum = ext4_group_desc_csum(sbi, group, gdp);
-		ext4_unlock_group(sb, group);
 	}
+	ext4_unlock_group(sb, group);
 
 	BUFFER_TRACE(group_desc_bh, "call ext4_handle_dirty_metadata");
 	err = ext4_handle_dirty_metadata(handle, NULL, group_desc_bh);



^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 3.4 07/24] drm/i915: kick any firmware framebuffers before claiming the gtt
  2014-02-18 22:46 [PATCH 3.4 00/24] 3.4.81-stable review Greg Kroah-Hartman
                   ` (5 preceding siblings ...)
  2014-02-18 22:46 ` [PATCH 3.4 06/24] ext4: protect group inode free counting with group lock Greg Kroah-Hartman
@ 2014-02-18 22:46 ` Greg Kroah-Hartman
  2014-02-18 22:46 ` [PATCH 3.4 08/24] mm/page_alloc.c: remove pageblock_default_order() Greg Kroah-Hartman
                   ` (18 subsequent siblings)
  25 siblings, 0 replies; 28+ messages in thread
From: Greg Kroah-Hartman @ 2014-02-18 22:46 UTC (permalink / raw)
  To: linux-kernel
  Cc: Greg Kroah-Hartman, stable, Chris Wilson, Daniel Vetter,
	Dave Airlie, Li Zefan

3.4-stable review patch.  If anyone has any objections, please let me know.

------------------

From: Daniel Vetter <daniel.vetter@ffwll.ch>

commit 9f846a16d213523fbe6daea17e20df6b8ac5a1e5 upstream.

Especially vesafb likes to map everything as uc- (yikes), and if that
mapping hangs around still while we try to map the gtt as wc the
kernel will downgrade our request to uc-, resulting in abyssal
performance.

Unfortunately we can't do this as early as readon does (i.e. as the
first thing we do when initializing the hw) because our fb/mmio space
region moves around on a per-gen basis. So I've had to move it below
the gtt initialization, but that seems to work, too. The important
thing is that we do this before we set up the gtt wc mapping.

Now an altogether different question is why people compile their
kernels with vesafb enabled, but I guess making things just work isn't
bad per se ...

v2:
- s/radeondrmfb/inteldrmfb/
- fix up error handling

v3: Kill #ifdef X86, this is Intel after all. Noticed by Ben Widawsky.

v4: Jani Nikula complained about the pointless bool primary
initialization.

v5: Don't oops if we can't allocate, noticed by Chris Wilson.

v6: Resolve conflicts with agp rework and fixup whitespace.

This is commit e188719a2891f01b3100d in drm-next.

Backport to 3.5 -fixes queue requested by Dave Airlie - due to grub
using vesa on fedora their initrd seems to load vesafb before loading
the real kms driver. So tons more people actually experience a
dead-slow gpu. Hence also the Cc: stable.

Reported-and-tested-by: "Kilarski, Bernard R" <bernard.r.kilarski@intel.com>
Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Signed-off-by: Dave Airlie <airlied@redhat.com>
[lizf: Backported to 3.4: adjust context]
Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


---
 drivers/gpu/drm/i915/i915_dma.c |   37 ++++++++++++++++++++++++++++++-------
 1 file changed, 30 insertions(+), 7 deletions(-)

--- a/drivers/gpu/drm/i915/i915_dma.c
+++ b/drivers/gpu/drm/i915/i915_dma.c
@@ -1934,6 +1934,27 @@ ips_ping_for_i915_load(void)
 	}
 }
 
+static void i915_kick_out_firmware_fb(struct drm_i915_private *dev_priv)
+{
+	struct apertures_struct *ap;
+	struct pci_dev *pdev = dev_priv->dev->pdev;
+	bool primary;
+
+	ap = alloc_apertures(1);
+	if (!ap)
+		return;
+
+	ap->ranges[0].base = dev_priv->dev->agp->base;
+	ap->ranges[0].size =
+		dev_priv->mm.gtt->gtt_mappable_entries << PAGE_SHIFT;
+	primary =
+		pdev->resource[PCI_ROM_RESOURCE].flags & IORESOURCE_ROM_SHADOW;
+
+	remove_conflicting_framebuffers(ap, "inteldrmfb", primary);
+
+	kfree(ap);
+}
+
 /**
  * i915_driver_load - setup chip and create an initial config
  * @dev: DRM device
@@ -1971,6 +1992,15 @@ int i915_driver_load(struct drm_device *
 		goto free_priv;
 	}
 
+	dev_priv->mm.gtt = intel_gtt_get();
+	if (!dev_priv->mm.gtt) {
+		DRM_ERROR("Failed to initialize GTT\n");
+		ret = -ENODEV;
+		goto put_bridge;
+	}
+
+	i915_kick_out_firmware_fb(dev_priv);
+
 	pci_set_master(dev->pdev);
 
 	/* overlay on gen2 is broken and can't address above 1G */
@@ -1996,13 +2026,6 @@ int i915_driver_load(struct drm_device *
 		goto put_bridge;
 	}
 
-	dev_priv->mm.gtt = intel_gtt_get();
-	if (!dev_priv->mm.gtt) {
-		DRM_ERROR("Failed to initialize GTT\n");
-		ret = -ENODEV;
-		goto out_rmmap;
-	}
-
 	agp_size = dev_priv->mm.gtt->gtt_mappable_entries << PAGE_SHIFT;
 
 	dev_priv->mm.gtt_mapping =



^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 3.4 08/24] mm/page_alloc.c: remove pageblock_default_order()
  2014-02-18 22:46 [PATCH 3.4 00/24] 3.4.81-stable review Greg Kroah-Hartman
                   ` (6 preceding siblings ...)
  2014-02-18 22:46 ` [PATCH 3.4 07/24] drm/i915: kick any firmware framebuffers before claiming the gtt Greg Kroah-Hartman
@ 2014-02-18 22:46 ` Greg Kroah-Hartman
  2014-02-18 22:46 ` [PATCH 3.4 09/24] mm: setup pageblock_order before its used by sparsemem Greg Kroah-Hartman
                   ` (17 subsequent siblings)
  25 siblings, 0 replies; 28+ messages in thread
From: Greg Kroah-Hartman @ 2014-02-18 22:46 UTC (permalink / raw)
  To: linux-kernel
  Cc: Greg Kroah-Hartman, stable, rajman mekaco, Mel Gorman,
	KAMEZAWA Hiroyuki, Tejun Heo, Minchan Kim, Andrew Morton,
	Linus Torvalds, Li Zefan

3.4-stable review patch.  If anyone has any objections, please let me know.

------------------

From: Andrew Morton <akpm@linux-foundation.org>

commit 955c1cd7401565671b064e499115344ec8067dfd upstream.

This has always been broken: one version takes an unsigned int and the
other version takes no arguments.  This bug was hidden because one
version of set_pageblock_order() was a macro which doesn't evaluate its
argument.

Simplify it all and remove pageblock_default_order() altogether.

Reported-by: rajman mekaco <rajman.mekaco@gmail.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[lizf: Backported to 3.4: adjust context]
Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

---
 mm/page_alloc.c |   33 +++++++++++++++------------------
 1 file changed, 15 insertions(+), 18 deletions(-)

--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4254,25 +4254,24 @@ static inline void setup_usemap(struct p
 
 #ifdef CONFIG_HUGETLB_PAGE_SIZE_VARIABLE
 
-/* Return a sensible default order for the pageblock size. */
-static inline int pageblock_default_order(void)
-{
-	if (HPAGE_SHIFT > PAGE_SHIFT)
-		return HUGETLB_PAGE_ORDER;
-
-	return MAX_ORDER-1;
-}
-
 /* Initialise the number of pages represented by NR_PAGEBLOCK_BITS */
-static inline void __init set_pageblock_order(unsigned int order)
+static inline void __init set_pageblock_order(void)
 {
+	unsigned int order;
+
 	/* Check that pageblock_nr_pages has not already been setup */
 	if (pageblock_order)
 		return;
 
+	if (HPAGE_SHIFT > PAGE_SHIFT)
+		order = HUGETLB_PAGE_ORDER;
+	else
+		order = MAX_ORDER - 1;
+
 	/*
 	 * Assume the largest contiguous order of interest is a huge page.
-	 * This value may be variable depending on boot parameters on IA64
+	 * This value may be variable depending on boot parameters on IA64 and
+	 * powerpc.
 	 */
 	pageblock_order = order;
 }
@@ -4280,15 +4279,13 @@ static inline void __init set_pageblock_
 
 /*
  * When CONFIG_HUGETLB_PAGE_SIZE_VARIABLE is not set, set_pageblock_order()
- * and pageblock_default_order() are unused as pageblock_order is set
- * at compile-time. See include/linux/pageblock-flags.h for the values of
- * pageblock_order based on the kernel config
+ * is unused as pageblock_order is set at compile-time. See
+ * include/linux/pageblock-flags.h for the values of pageblock_order based on
+ * the kernel config
  */
-static inline int pageblock_default_order(unsigned int order)
+static inline void set_pageblock_order(void)
 {
-	return MAX_ORDER-1;
 }
-#define set_pageblock_order(x)	do {} while (0)
 
 #endif /* CONFIG_HUGETLB_PAGE_SIZE_VARIABLE */
 
@@ -4376,7 +4373,7 @@ static void __paginginit free_area_init_
 		if (!size)
 			continue;
 
-		set_pageblock_order(pageblock_default_order());
+		set_pageblock_order();
 		setup_usemap(pgdat, zone, zone_start_pfn, size);
 		ret = init_currently_empty_zone(zone, zone_start_pfn,
 						size, MEMMAP_EARLY);



^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 3.4 09/24] mm: setup pageblock_order before its used by sparsemem
  2014-02-18 22:46 [PATCH 3.4 00/24] 3.4.81-stable review Greg Kroah-Hartman
                   ` (7 preceding siblings ...)
  2014-02-18 22:46 ` [PATCH 3.4 08/24] mm/page_alloc.c: remove pageblock_default_order() Greg Kroah-Hartman
@ 2014-02-18 22:46 ` Greg Kroah-Hartman
  2014-02-18 22:46 ` [PATCH 3.4 10/24] dm sysfs: fix a module unload race Greg Kroah-Hartman
                   ` (16 subsequent siblings)
  25 siblings, 0 replies; 28+ messages in thread
From: Greg Kroah-Hartman @ 2014-02-18 22:46 UTC (permalink / raw)
  To: linux-kernel
  Cc: Greg Kroah-Hartman, stable, Xishi Qiu, Jiang Liu, Tony Luck,
	Yinghai Lu, KAMEZAWA Hiroyuki, Benjamin Herrenschmidt,
	KOSAKI Motohiro, David Rientjes, Keping Chen, Andrew Morton,
	Linus Torvalds, Li Zefan

3.4-stable review patch.  If anyone has any objections, please let me know.

------------------

From: Xishi Qiu <qiuxishi@huawei.com>

commit ca57df79d4f64e1a4886606af4289d40636189c5 upstream.

On architectures with CONFIG_HUGETLB_PAGE_SIZE_VARIABLE set, such as
Itanium, pageblock_order is a variable with default value of 0.  It's set
to the right value by set_pageblock_order() in function
free_area_init_core().

But pageblock_order may be used by sparse_init() before free_area_init_core()
is called along path:
sparse_init()
    ->sparse_early_usemaps_alloc_node()
	->usemap_size()
	    ->SECTION_BLOCKFLAGS_BITS
		->((1UL << (PFN_SECTION_SHIFT - pageblock_order)) *
NR_PAGEBLOCK_BITS)

The uninitialized pageblock_size will cause memory wasting because
usemap_size() returns a much bigger value then it's really needed.

For example, on an Itanium platform,
sparse_init() pageblock_order=0 usemap_size=24576
free_area_init_core() before pageblock_order=0, usemap_size=24576
free_area_init_core() after pageblock_order=12, usemap_size=8

That means 24K memory has been wasted for each section, so fix it by calling
set_pageblock_order() from sparse_init().

Signed-off-by: Xishi Qiu <qiuxishi@huawei.com>
Signed-off-by: Jiang Liu <liuj97@gmail.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Keping Chen <chenkeping@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[lizf: Backported to 3.4: adjust context]
Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---
 mm/internal.h   |    2 ++
 mm/page_alloc.c |    4 ++--
 mm/sparse.c     |    3 +++
 3 files changed, 7 insertions(+), 2 deletions(-)

--- a/mm/internal.h
+++ b/mm/internal.h
@@ -309,3 +309,5 @@ extern u64 hwpoison_filter_flags_mask;
 extern u64 hwpoison_filter_flags_value;
 extern u64 hwpoison_filter_memcg;
 extern u32 hwpoison_filter_enable;
+
+extern void set_pageblock_order(void);
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4255,7 +4255,7 @@ static inline void setup_usemap(struct p
 #ifdef CONFIG_HUGETLB_PAGE_SIZE_VARIABLE
 
 /* Initialise the number of pages represented by NR_PAGEBLOCK_BITS */
-static inline void __init set_pageblock_order(void)
+void __init set_pageblock_order(void)
 {
 	unsigned int order;
 
@@ -4283,7 +4283,7 @@ static inline void __init set_pageblock_
  * include/linux/pageblock-flags.h for the values of pageblock_order based on
  * the kernel config
  */
-static inline void set_pageblock_order(void)
+void __init set_pageblock_order(void)
 {
 }
 
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -486,6 +486,9 @@ void __init sparse_init(void)
 	struct page **map_map;
 #endif
 
+	/* Setup pageblock_order for HUGETLB_PAGE_SIZE_VARIABLE */
+	set_pageblock_order();
+
 	/*
 	 * map is using big page (aka 2M in x86 64 bit)
 	 * usemap is less one page (aka 24 bytes)



^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 3.4 10/24] dm sysfs: fix a module unload race
  2014-02-18 22:46 [PATCH 3.4 00/24] 3.4.81-stable review Greg Kroah-Hartman
                   ` (8 preceding siblings ...)
  2014-02-18 22:46 ` [PATCH 3.4 09/24] mm: setup pageblock_order before its used by sparsemem Greg Kroah-Hartman
@ 2014-02-18 22:46 ` Greg Kroah-Hartman
  2014-02-18 22:46 ` [PATCH 3.4 11/24] ftrace: Synchronize setting function_trace_op with ftrace_trace_function Greg Kroah-Hartman
                   ` (15 subsequent siblings)
  25 siblings, 0 replies; 28+ messages in thread
From: Greg Kroah-Hartman @ 2014-02-18 22:46 UTC (permalink / raw)
  To: linux-kernel; +Cc: Greg Kroah-Hartman, stable, Mikulas Patocka, Mike Snitzer

3.4-stable review patch.  If anyone has any objections, please let me know.

------------------

From: Mikulas Patocka <mpatocka@redhat.com>

commit 2995fa78e423d7193f3b57835f6c1c75006a0315 upstream.

This reverts commit be35f48610 ("dm: wait until embedded kobject is
released before destroying a device") and provides an improved fix.

The kobject release code that calls the completion must be placed in a
non-module file, otherwise there is a module unload race (if the process
calling dm_kobject_release is preempted and the DM module unloaded after
the completion is triggered, but before dm_kobject_release returns).

To fix this race, this patch moves the completion code to dm-builtin.c
which is always compiled directly into the kernel if BLK_DEV_DM is
selected.

The patch introduces a new dm_kobject_holder structure, its purpose is
to keep the completion and kobject in one place, so that it can be
accessed from non-module code without the need to export the layout of
struct mapped_device to that code.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

---
 drivers/md/Kconfig      |    4 +++
 drivers/md/Makefile     |    1 
 drivers/md/dm-builtin.c |   50 ++++++++++++++++++++++++++++++++++++++++++++++++
 drivers/md/dm-sysfs.c   |    5 ----
 drivers/md/dm.c         |   26 ++++--------------------
 drivers/md/dm.h         |   17 +++++++++++++++-
 6 files changed, 76 insertions(+), 27 deletions(-)

--- a/drivers/md/Kconfig
+++ b/drivers/md/Kconfig
@@ -185,8 +185,12 @@ config MD_FAULTY
 
 	  In unsure, say N.
 
+config BLK_DEV_DM_BUILTIN
+	boolean
+
 config BLK_DEV_DM
 	tristate "Device mapper support"
+	select BLK_DEV_DM_BUILTIN
 	---help---
 	  Device-mapper is a low level volume manager.  It works by allowing
 	  people to specify mappings for ranges of logical sectors.  Various
--- a/drivers/md/Makefile
+++ b/drivers/md/Makefile
@@ -28,6 +28,7 @@ obj-$(CONFIG_MD_MULTIPATH)	+= multipath.
 obj-$(CONFIG_MD_FAULTY)		+= faulty.o
 obj-$(CONFIG_BLK_DEV_MD)	+= md-mod.o
 obj-$(CONFIG_BLK_DEV_DM)	+= dm-mod.o
+obj-$(CONFIG_BLK_DEV_DM_BUILTIN) += dm-builtin.o
 obj-$(CONFIG_DM_BUFIO)		+= dm-bufio.o
 obj-$(CONFIG_DM_CRYPT)		+= dm-crypt.o
 obj-$(CONFIG_DM_DELAY)		+= dm-delay.o
--- /dev/null
+++ b/drivers/md/dm-builtin.c
@@ -0,0 +1,50 @@
+#include "dm.h"
+
+#include <linux/export.h>
+
+/*
+ * The kobject release method must not be placed in the module itself,
+ * otherwise we are subject to module unload races.
+ *
+ * The release method is called when the last reference to the kobject is
+ * dropped. It may be called by any other kernel code that drops the last
+ * reference.
+ *
+ * The release method suffers from module unload race. We may prevent the
+ * module from being unloaded at the start of the release method (using
+ * increased module reference count or synchronizing against the release
+ * method), however there is no way to prevent the module from being
+ * unloaded at the end of the release method.
+ *
+ * If this code were placed in the dm module, the following race may
+ * happen:
+ *  1. Some other process takes a reference to dm kobject
+ *  2. The user issues ioctl function to unload the dm device
+ *  3. dm_sysfs_exit calls kobject_put, however the object is not released
+ *     because of the other reference taken at step 1
+ *  4. dm_sysfs_exit waits on the completion
+ *  5. The other process that took the reference in step 1 drops it,
+ *     dm_kobject_release is called from this process
+ *  6. dm_kobject_release calls complete()
+ *  7. a reschedule happens before dm_kobject_release returns
+ *  8. dm_sysfs_exit continues, the dm device is unloaded, module reference
+ *     count is decremented
+ *  9. The user unloads the dm module
+ * 10. The other process that was rescheduled in step 7 continues to run,
+ *     it is now executing code in unloaded module, so it crashes
+ *
+ * Note that if the process that takes the foreign reference to dm kobject
+ * has a low priority and the system is sufficiently loaded with
+ * higher-priority processes that prevent the low-priority process from
+ * being scheduled long enough, this bug may really happen.
+ *
+ * In order to fix this module unload race, we place the release method
+ * into a helper code that is compiled directly into the kernel.
+ */
+
+void dm_kobject_release(struct kobject *kobj)
+{
+	complete(dm_get_completion_from_kobject(kobj));
+}
+
+EXPORT_SYMBOL(dm_kobject_release);
--- a/drivers/md/dm-sysfs.c
+++ b/drivers/md/dm-sysfs.c
@@ -79,11 +79,6 @@ static const struct sysfs_ops dm_sysfs_o
 	.show	= dm_attr_show,
 };
 
-static void dm_kobject_release(struct kobject *kobj)
-{
-	complete(dm_get_completion_from_kobject(kobj));
-}
-
 /*
  * dm kobject is embedded in mapped_device structure
  * no need to define release function here
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -191,11 +191,8 @@ struct mapped_device {
 	/* forced geometry settings */
 	struct hd_geometry geometry;
 
-	/* sysfs handle */
-	struct kobject kobj;
-
-	/* wait until the kobject is released */
-	struct completion kobj_completion;
+	/* kobject and completion */
+	struct dm_kobject_holder kobj_holder;
 
 	/* zero-length flush that will be cloned and submitted to targets */
 	struct bio flush_bio;
@@ -1894,7 +1891,7 @@ static struct mapped_device *alloc_dev(i
 	init_waitqueue_head(&md->wait);
 	INIT_WORK(&md->work, dm_wq_work);
 	init_waitqueue_head(&md->eventq);
-	init_completion(&md->kobj_completion);
+	init_completion(&md->kobj_holder.completion);
 
 	md->disk->major = _major;
 	md->disk->first_minor = minor;
@@ -2686,20 +2683,14 @@ struct gendisk *dm_disk(struct mapped_de
 
 struct kobject *dm_kobject(struct mapped_device *md)
 {
-	return &md->kobj;
+	return &md->kobj_holder.kobj;
 }
 
-/*
- * struct mapped_device should not be exported outside of dm.c
- * so use this check to verify that kobj is part of md structure
- */
 struct mapped_device *dm_get_from_kobject(struct kobject *kobj)
 {
 	struct mapped_device *md;
 
-	md = container_of(kobj, struct mapped_device, kobj);
-	if (&md->kobj != kobj)
-		return NULL;
+	md = container_of(kobj, struct mapped_device, kobj_holder.kobj);
 
 	if (test_bit(DMF_FREEING, &md->flags) ||
 	    dm_deleting_md(md))
@@ -2709,13 +2700,6 @@ struct mapped_device *dm_get_from_kobjec
 	return md;
 }
 
-struct completion *dm_get_completion_from_kobject(struct kobject *kobj)
-{
-	struct mapped_device *md = container_of(kobj, struct mapped_device, kobj);
-
-	return &md->kobj_completion;
-}
-
 int dm_suspended_md(struct mapped_device *md)
 {
 	return test_bit(DMF_SUSPENDED, &md->flags);
--- a/drivers/md/dm.h
+++ b/drivers/md/dm.h
@@ -16,6 +16,7 @@
 #include <linux/blkdev.h>
 #include <linux/hdreg.h>
 #include <linux/completion.h>
+#include <linux/kobject.h>
 
 /*
  * Suspend feature flags
@@ -120,11 +121,25 @@ void dm_interface_exit(void);
 /*
  * sysfs interface
  */
+struct dm_kobject_holder {
+	struct kobject kobj;
+	struct completion completion;
+};
+
+static inline struct completion *dm_get_completion_from_kobject(struct kobject *kobj)
+{
+	return &container_of(kobj, struct dm_kobject_holder, kobj)->completion;
+}
+
 int dm_sysfs_init(struct mapped_device *md);
 void dm_sysfs_exit(struct mapped_device *md);
 struct kobject *dm_kobject(struct mapped_device *md);
 struct mapped_device *dm_get_from_kobject(struct kobject *kobj);
-struct completion *dm_get_completion_from_kobject(struct kobject *kobj);
+
+/*
+ * The kobject helper
+ */
+void dm_kobject_release(struct kobject *kobj);
 
 /*
  * Targets for linear and striped mappings



^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 3.4 11/24] ftrace: Synchronize setting function_trace_op with ftrace_trace_function
  2014-02-18 22:46 [PATCH 3.4 00/24] 3.4.81-stable review Greg Kroah-Hartman
                   ` (9 preceding siblings ...)
  2014-02-18 22:46 ` [PATCH 3.4 10/24] dm sysfs: fix a module unload race Greg Kroah-Hartman
@ 2014-02-18 22:46 ` Greg Kroah-Hartman
  2014-02-18 22:46 ` [PATCH 3.4 12/24] ftrace: Fix synchronization location disabling and freeing ftrace_ops Greg Kroah-Hartman
                   ` (14 subsequent siblings)
  25 siblings, 0 replies; 28+ messages in thread
From: Greg Kroah-Hartman @ 2014-02-18 22:46 UTC (permalink / raw)
  To: linux-kernel, Steven Rostedt; +Cc: Greg Kroah-Hartman, stable

3.4-stable review patch.  If anyone has any objections, please let me know.

------------------

From: Steven Rostedt <rostedt@goodmis.org>

commit 405e1d834807e51b2ebd3dea81cb51e53fb61504 upstream.

[ Partial commit backported to 3.4. The ftrace_sync() code by this is
  required for other fixes that 3.4 needs. ]

ftrace_trace_function is a variable that holds what function will be called
directly by the assembly code (mcount). If just a single function is
registered and it handles recursion itself, then the assembly will call that
function directly without any helper function. It also passes in the
ftrace_op that was registered with the callback. The ftrace_op to send is
stored in the function_trace_op variable.

The ftrace_trace_function and function_trace_op needs to be coordinated such
that the called callback wont be called with the wrong ftrace_op, otherwise
bad things can happen if it expected a different op. Luckily, there's no
callback that doesn't use the helper functions that requires this. But
there soon will be and this needs to be fixed.

Use a set_function_trace_op to store the ftrace_op to set the
function_trace_op to when it is safe to do so (during the update function
within the breakpoint or stop machine calls). Or if dynamic ftrace is not
being used (static tracing) then we have to do a bit more synchronization
when the ftrace_trace_function is set as that takes affect immediately
(as oppose to dynamic ftrace doing it with the modification of the trampoline).

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---
 kernel/trace/ftrace.c |   17 +++++++++++++++++
 1 file changed, 17 insertions(+)

--- a/kernel/trace/ftrace.c
+++ b/kernel/trace/ftrace.c
@@ -222,6 +222,23 @@ static void update_global_ops(void)
 	global_ops.func = func;
 }
 
+static void ftrace_sync(struct work_struct *work)
+{
+	/*
+	 * This function is just a stub to implement a hard force
+	 * of synchronize_sched(). This requires synchronizing
+	 * tasks even in userspace and idle.
+	 *
+	 * Yes, function tracing is rude.
+	 */
+}
+
+static void ftrace_sync_ipi(void *data)
+{
+	/* Probably not needed, but do it anyway */
+	smp_rmb();
+}
+
 static void update_ftrace_function(void)
 {
 	ftrace_func_t func;



^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 3.4 12/24] ftrace: Fix synchronization location disabling and freeing ftrace_ops
  2014-02-18 22:46 [PATCH 3.4 00/24] 3.4.81-stable review Greg Kroah-Hartman
                   ` (10 preceding siblings ...)
  2014-02-18 22:46 ` [PATCH 3.4 11/24] ftrace: Synchronize setting function_trace_op with ftrace_trace_function Greg Kroah-Hartman
@ 2014-02-18 22:46 ` Greg Kroah-Hartman
  2014-02-18 22:47 ` [PATCH 3.4 13/24] ftrace: Have function graph only trace based on global_ops filters Greg Kroah-Hartman
                   ` (13 subsequent siblings)
  25 siblings, 0 replies; 28+ messages in thread
From: Greg Kroah-Hartman @ 2014-02-18 22:46 UTC (permalink / raw)
  To: linux-kernel, Steven Rostedt; +Cc: Greg Kroah-Hartman, stable

3.4-stable review patch.  If anyone has any objections, please let me know.

------------------

From: Steven Rostedt <rostedt@goodmis.org>

commit a4c35ed241129dd142be4cadb1e5a474a56d5464 upstream.

The synchronization needed after ftrace_ops are unregistered must happen
after the callback is disabled from becing called by functions.

The current location happens after the function is being removed from the
internal lists, but not after the function callbacks were disabled, leaving
the functions susceptible of being called after their callbacks are freed.

This affects perf and any externel users of function tracing (LTTng and
SystemTap).

Fixes: cdbe61bfe704 "ftrace: Allow dynamically allocated function tracers"
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---
 kernel/trace/ftrace.c |   50 ++++++++++++++++++++++++++++++++------------------
 1 file changed, 32 insertions(+), 18 deletions(-)

--- a/kernel/trace/ftrace.c
+++ b/kernel/trace/ftrace.c
@@ -376,16 +376,6 @@ static int __unregister_ftrace_function(
 	} else if (ops->flags & FTRACE_OPS_FL_CONTROL) {
 		ret = remove_ftrace_list_ops(&ftrace_control_list,
 					     &control_ops, ops);
-		if (!ret) {
-			/*
-			 * The ftrace_ops is now removed from the list,
-			 * so there'll be no new users. We must ensure
-			 * all current users are done before we free
-			 * the control data.
-			 */
-			synchronize_sched();
-			control_ops_free(ops);
-		}
 	} else
 		ret = remove_ftrace_ops(&ftrace_ops_list, ops);
 
@@ -395,13 +385,6 @@ static int __unregister_ftrace_function(
 	if (ftrace_enabled)
 		update_ftrace_function();
 
-	/*
-	 * Dynamic ops may be freed, we must make sure that all
-	 * callers are done before leaving this function.
-	 */
-	if (ops->flags & FTRACE_OPS_FL_DYNAMIC)
-		synchronize_sched();
-
 	return 0;
 }
 
@@ -2025,10 +2008,41 @@ static int ftrace_shutdown(struct ftrace
 		command |= FTRACE_UPDATE_TRACE_FUNC;
 	}
 
-	if (!command || !ftrace_enabled)
+	if (!command || !ftrace_enabled) {
+		/*
+		 * If these are control ops, they still need their
+		 * per_cpu field freed. Since, function tracing is
+		 * not currently active, we can just free them
+		 * without synchronizing all CPUs.
+		 */
+		if (ops->flags & FTRACE_OPS_FL_CONTROL)
+			control_ops_free(ops);
 		return 0;
+	}
 
 	ftrace_run_update_code(command);
+
+	/*
+	 * Dynamic ops may be freed, we must make sure that all
+	 * callers are done before leaving this function.
+	 * The same goes for freeing the per_cpu data of the control
+	 * ops.
+	 *
+	 * Again, normal synchronize_sched() is not good enough.
+	 * We need to do a hard force of sched synchronization.
+	 * This is because we use preempt_disable() to do RCU, but
+	 * the function tracers can be called where RCU is not watching
+	 * (like before user_exit()). We can not rely on the RCU
+	 * infrastructure to do the synchronization, thus we must do it
+	 * ourselves.
+	 */
+	if (ops->flags & (FTRACE_OPS_FL_DYNAMIC | FTRACE_OPS_FL_CONTROL)) {
+		schedule_on_each_cpu(ftrace_sync);
+
+		if (ops->flags & FTRACE_OPS_FL_CONTROL)
+			control_ops_free(ops);
+	}
+
 	return 0;
 }
 



^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 3.4 13/24] ftrace: Have function graph only trace based on global_ops filters
  2014-02-18 22:46 [PATCH 3.4 00/24] 3.4.81-stable review Greg Kroah-Hartman
                   ` (11 preceding siblings ...)
  2014-02-18 22:46 ` [PATCH 3.4 12/24] ftrace: Fix synchronization location disabling and freeing ftrace_ops Greg Kroah-Hartman
@ 2014-02-18 22:47 ` Greg Kroah-Hartman
  2014-02-18 22:47 ` [PATCH 3.4 14/24] sched/nohz: Fix rq->cpu_load[] calculations Greg Kroah-Hartman
                   ` (12 subsequent siblings)
  25 siblings, 0 replies; 28+ messages in thread
From: Greg Kroah-Hartman @ 2014-02-18 22:47 UTC (permalink / raw)
  To: linux-kernel, Steven Rostedt; +Cc: Greg Kroah-Hartman, stable

3.4-stable review patch.  If anyone has any objections, please let me know.

------------------

From: Steven Rostedt <rostedt@goodmis.org>

commit 23a8e8441a0a74dd612edf81dc89d1600bc0a3d1 upstream.

Doing some different tests, I discovered that function graph tracing, when
filtered via the set_ftrace_filter and set_ftrace_notrace files, does
not always keep with them if another function ftrace_ops is registered
to trace functions.

The reason is that function graph just happens to trace all functions
that the function tracer enables. When there was only one user of
function tracing, the function graph tracer did not need to worry about
being called by functions that it did not want to trace. But now that there
are other users, this becomes a problem.

For example, one just needs to do the following:

 # cd /sys/kernel/debug/tracing
 # echo schedule > set_ftrace_filter
 # echo function_graph > current_tracer
 # cat trace
[..]
 0)               |  schedule() {
 ------------------------------------------
 0)    <idle>-0    =>   rcu_pre-7
 ------------------------------------------

 0) ! 2980.314 us |  }
 0)               |  schedule() {
 ------------------------------------------
 0)   rcu_pre-7    =>    <idle>-0
 ------------------------------------------

 0) + 20.701 us   |  }

 # echo 1 > /proc/sys/kernel/stack_tracer_enabled
 # cat trace
[..]
 1) + 20.825 us   |      }
 1) + 21.651 us   |    }
 1) + 30.924 us   |  } /* SyS_ioctl */
 1)               |  do_page_fault() {
 1)               |    __do_page_fault() {
 1)   0.274 us    |      down_read_trylock();
 1)   0.098 us    |      find_vma();
 1)               |      handle_mm_fault() {
 1)               |        _raw_spin_lock() {
 1)   0.102 us    |          preempt_count_add();
 1)   0.097 us    |          do_raw_spin_lock();
 1)   2.173 us    |        }
 1)               |        do_wp_page() {
 1)   0.079 us    |          vm_normal_page();
 1)   0.086 us    |          reuse_swap_page();
 1)   0.076 us    |          page_move_anon_rmap();
 1)               |          unlock_page() {
 1)   0.082 us    |            page_waitqueue();
 1)   0.086 us    |            __wake_up_bit();
 1)   1.801 us    |          }
 1)   0.075 us    |          ptep_set_access_flags();
 1)               |          _raw_spin_unlock() {
 1)   0.098 us    |            do_raw_spin_unlock();
 1)   0.105 us    |            preempt_count_sub();
 1)   1.884 us    |          }
 1)   9.149 us    |        }
 1) + 13.083 us   |      }
 1)   0.146 us    |      up_read();

When the stack tracer was enabled, it enabled all functions to be traced, which
now the function graph tracer also traces. This is a side effect that should
not occur.

To fix this a test is added when the function tracing is changed, as well as when
the graph tracer is enabled, to see if anything other than the ftrace global_ops
function tracer is enabled. If so, then the graph tracer calls a test trampoline
that will look at the function that is being traced and compare it with the
filters defined by the global_ops.

As an optimization, if there's no other function tracers registered, or if
the only registered function tracers also use the global ops, the function
graph infrastructure will call the registered function graph callback directly
and not go through the test trampoline.

Fixes: d2d45c7a03a2 "tracing: Have stack_tracer use a separate list of functions"
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---
 kernel/trace/ftrace.c |   45 ++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 44 insertions(+), 1 deletion(-)

--- a/kernel/trace/ftrace.c
+++ b/kernel/trace/ftrace.c
@@ -239,6 +239,12 @@ static void ftrace_sync_ipi(void *data)
 	smp_rmb();
 }
 
+#ifdef CONFIG_FUNCTION_GRAPH_TRACER
+static void update_function_graph_func(void);
+#else
+static inline void update_function_graph_func(void) { }
+#endif
+
 static void update_ftrace_function(void)
 {
 	ftrace_func_t func;
@@ -257,6 +263,8 @@ static void update_ftrace_function(void)
 	else
 		func = ftrace_ops_list_func;
 
+	update_function_graph_func();
+
 #ifdef CONFIG_HAVE_FUNCTION_TRACE_MCOUNT_TEST
 	ftrace_trace_function = func;
 #else
@@ -4435,6 +4443,7 @@ int ftrace_graph_entry_stub(struct ftrac
 trace_func_graph_ret_t ftrace_graph_return =
 			(trace_func_graph_ret_t)ftrace_stub;
 trace_func_graph_ent_t ftrace_graph_entry = ftrace_graph_entry_stub;
+static trace_func_graph_ent_t __ftrace_graph_entry = ftrace_graph_entry_stub;
 
 /* Try to assign a return stack array on FTRACE_RETSTACK_ALLOC_SIZE tasks. */
 static int alloc_retstack_tasklist(struct ftrace_ret_stack **ret_stack_list)
@@ -4575,6 +4584,30 @@ static struct ftrace_ops fgraph_ops __re
 	.flags		= FTRACE_OPS_FL_GLOBAL,
 };
 
+static int ftrace_graph_entry_test(struct ftrace_graph_ent *trace)
+{
+	if (!ftrace_ops_test(&global_ops, trace->func))
+		return 0;
+	return __ftrace_graph_entry(trace);
+}
+
+/*
+ * The function graph tracer should only trace the functions defined
+ * by set_ftrace_filter and set_ftrace_notrace. If another function
+ * tracer ops is registered, the graph tracer requires testing the
+ * function against the global ops, and not just trace any function
+ * that any ftrace_ops registered.
+ */
+static void update_function_graph_func(void)
+{
+	if (ftrace_ops_list == &ftrace_list_end ||
+	    (ftrace_ops_list == &global_ops &&
+	     global_ops.next == &ftrace_list_end))
+		ftrace_graph_entry = __ftrace_graph_entry;
+	else
+		ftrace_graph_entry = ftrace_graph_entry_test;
+}
+
 int register_ftrace_graph(trace_func_graph_ret_t retfunc,
 			trace_func_graph_ent_t entryfunc)
 {
@@ -4599,7 +4632,16 @@ int register_ftrace_graph(trace_func_gra
 	}
 
 	ftrace_graph_return = retfunc;
-	ftrace_graph_entry = entryfunc;
+
+	/*
+	 * Update the indirect function to the entryfunc, and the
+	 * function that gets called to the entry_test first. Then
+	 * call the update fgraph entry function to determine if
+	 * the entryfunc should be called directly or not.
+	 */
+	__ftrace_graph_entry = entryfunc;
+	ftrace_graph_entry = ftrace_graph_entry_test;
+	update_function_graph_func();
 
 	ret = ftrace_startup(&fgraph_ops, FTRACE_START_FUNC_RET);
 
@@ -4618,6 +4660,7 @@ void unregister_ftrace_graph(void)
 	ftrace_graph_active--;
 	ftrace_graph_return = (trace_func_graph_ret_t)ftrace_stub;
 	ftrace_graph_entry = ftrace_graph_entry_stub;
+	__ftrace_graph_entry = ftrace_graph_entry_stub;
 	ftrace_shutdown(&fgraph_ops, FTRACE_STOP_FUNC_RET);
 	unregister_pm_notifier(&ftrace_suspend_notifier);
 	unregister_trace_sched_switch(ftrace_graph_probe_sched_switch, NULL);



^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 3.4 14/24] sched/nohz: Fix rq->cpu_load[] calculations
  2014-02-18 22:46 [PATCH 3.4 00/24] 3.4.81-stable review Greg Kroah-Hartman
                   ` (12 preceding siblings ...)
  2014-02-18 22:47 ` [PATCH 3.4 13/24] ftrace: Have function graph only trace based on global_ops filters Greg Kroah-Hartman
@ 2014-02-18 22:47 ` Greg Kroah-Hartman
  2014-02-18 22:47 ` [PATCH 3.4 15/24] sched/nohz: Fix rq->cpu_load calculations some more Greg Kroah-Hartman
                   ` (11 subsequent siblings)
  25 siblings, 0 replies; 28+ messages in thread
From: Greg Kroah-Hartman @ 2014-02-18 22:47 UTC (permalink / raw)
  To: linux-kernel
  Cc: Greg Kroah-Hartman, stable, Peter Zijlstra, pjt,
	Venkatesh Pallipadi, Ingo Molnar, Li Zefan

3.4-stable review patch.  If anyone has any objections, please let me know.

------------------

From: Peter Zijlstra <a.p.zijlstra@chello.nl>

commit 556061b00c9f2fd6a5524b6bde823ef12f299ecf upstream.

While investigating why the load-balancer did funny I found that the
rq->cpu_load[] tables were completely screwy.. a bit more digging
revealed that the updates that got through were missing ticks followed
by a catchup of 2 ticks.

The catchup assumes the cpu was idle during that time (since only nohz
can cause missed ticks and the machine is idle etc..) this means that
esp. the higher indices were significantly lower than they ought to
be.

The reason for this is that its not correct to compare against jiffies
on every jiffy on any other cpu than the cpu that updates jiffies.

This patch cludges around it by only doing the catch-up stuff from
nohz_idle_balance() and doing the regular stuff unconditionally from
the tick.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: pjt@google.com
Cc: Venkatesh Pallipadi <venki@google.com>
Link: http://lkml.kernel.org/n/tip-tp4kj18xdd5aj4vvj0qg55s2@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Li Zefan <lizefan@huawei.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

---
 kernel/sched/core.c  |   53 +++++++++++++++++++++++++++++++++++++--------------
 kernel/sched/fair.c  |    2 -
 kernel/sched/sched.h |    2 -
 3 files changed, 41 insertions(+), 16 deletions(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -692,8 +692,6 @@ int tg_nop(struct task_group *tg, void *
 }
 #endif
 
-void update_cpu_load(struct rq *this_rq);
-
 static void set_load_weight(struct task_struct *p)
 {
 	int prio = p->static_prio - MAX_RT_PRIO;
@@ -2620,22 +2618,13 @@ decay_load_missed(unsigned long load, un
  * scheduler tick (TICK_NSEC). With tickless idle this will not be called
  * every tick. We fix it up based on jiffies.
  */
-void update_cpu_load(struct rq *this_rq)
+static void __update_cpu_load(struct rq *this_rq, unsigned long this_load,
+			      unsigned long pending_updates)
 {
-	unsigned long this_load = this_rq->load.weight;
-	unsigned long curr_jiffies = jiffies;
-	unsigned long pending_updates;
 	int i, scale;
 
 	this_rq->nr_load_updates++;
 
-	/* Avoid repeated calls on same jiffy, when moving in and out of idle */
-	if (curr_jiffies == this_rq->last_load_update_tick)
-		return;
-
-	pending_updates = curr_jiffies - this_rq->last_load_update_tick;
-	this_rq->last_load_update_tick = curr_jiffies;
-
 	/* Update our load: */
 	this_rq->cpu_load[0] = this_load; /* Fasttrack for idx 0 */
 	for (i = 1, scale = 2; i < CPU_LOAD_IDX_MAX; i++, scale += scale) {
@@ -2660,9 +2649,45 @@ void update_cpu_load(struct rq *this_rq)
 	sched_avg_update(this_rq);
 }
 
+/*
+ * Called from nohz_idle_balance() to update the load ratings before doing the
+ * idle balance.
+ */
+void update_idle_cpu_load(struct rq *this_rq)
+{
+	unsigned long curr_jiffies = jiffies;
+	unsigned long load = this_rq->load.weight;
+	unsigned long pending_updates;
+
+	/*
+	 * Bloody broken means of dealing with nohz, but better than nothing..
+	 * jiffies is updated by one cpu, another cpu can drift wrt the jiffy
+	 * update and see 0 difference the one time and 2 the next, even though
+	 * we ticked at roughtly the same rate.
+	 *
+	 * Hence we only use this from nohz_idle_balance() and skip this
+	 * nonsense when called from the scheduler_tick() since that's
+	 * guaranteed a stable rate.
+	 */
+	if (load || curr_jiffies == this_rq->last_load_update_tick)
+		return;
+
+	pending_updates = curr_jiffies - this_rq->last_load_update_tick;
+	this_rq->last_load_update_tick = curr_jiffies;
+
+	__update_cpu_load(this_rq, load, pending_updates);
+}
+
+/*
+ * Called from scheduler_tick()
+ */
 static void update_cpu_load_active(struct rq *this_rq)
 {
-	update_cpu_load(this_rq);
+	/*
+	 * See the mess in update_idle_cpu_load().
+	 */
+	this_rq->last_load_update_tick = jiffies;
+	__update_cpu_load(this_rq, this_rq->load.weight, 1);
 
 	calc_load_account_active(this_rq);
 }
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5042,7 +5042,7 @@ static void nohz_idle_balance(int this_c
 
 		raw_spin_lock_irq(&this_rq->lock);
 		update_rq_clock(this_rq);
-		update_cpu_load(this_rq);
+		update_idle_cpu_load(this_rq);
 		raw_spin_unlock_irq(&this_rq->lock);
 
 		rebalance_domains(balance_cpu, CPU_IDLE);
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -873,7 +873,7 @@ extern void resched_cpu(int cpu);
 extern struct rt_bandwidth def_rt_bandwidth;
 extern void init_rt_bandwidth(struct rt_bandwidth *rt_b, u64 period, u64 runtime);
 
-extern void update_cpu_load(struct rq *this_rq);
+extern void update_idle_cpu_load(struct rq *this_rq);
 
 #ifdef CONFIG_CGROUP_CPUACCT
 #include <linux/cgroup.h>



^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 3.4 15/24] sched/nohz: Fix rq->cpu_load calculations some more
  2014-02-18 22:46 [PATCH 3.4 00/24] 3.4.81-stable review Greg Kroah-Hartman
                   ` (13 preceding siblings ...)
  2014-02-18 22:47 ` [PATCH 3.4 14/24] sched/nohz: Fix rq->cpu_load[] calculations Greg Kroah-Hartman
@ 2014-02-18 22:47 ` Greg Kroah-Hartman
  2014-02-18 22:47 ` [PATCH 3.4 16/24] IB/qib: Convert qib_user_sdma_pin_pages() to use get_user_pages_fast() Greg Kroah-Hartman
                   ` (10 subsequent siblings)
  25 siblings, 0 replies; 28+ messages in thread
From: Greg Kroah-Hartman @ 2014-02-18 22:47 UTC (permalink / raw)
  To: linux-kernel
  Cc: Greg Kroah-Hartman, stable, Peter Zijlstra, pjt,
	Venkatesh Pallipadi, Ingo Molnar, Li Zefan

3.4-stable review patch.  If anyone has any objections, please let me know.

------------------

From: Peter Zijlstra <a.p.zijlstra@chello.nl>

commit 5aaa0b7a2ed5b12692c9ffb5222182bd558d3146 upstream.

Follow up on commit 556061b00 ("sched/nohz: Fix rq->cpu_load[]
calculations") since while that fixed the busy case it regressed the
mostly idle case.

Add a callback from the nohz exit to also age the rq->cpu_load[]
array. This closes the hole where either there was no nohz load
balance pass during the nohz, or there was a 'significant' amount of
idle time between the last nohz balance and the nohz exit.

So we'll update unconditionally from the tick to not insert any
accidental 0 load periods while busy, and we try and catch up from
nohz idle balance and nohz exit. Both these are still prone to missing
a jiffy, but that has always been the case.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: pjt@google.com
Cc: Venkatesh Pallipadi <venki@google.com>
Link: http://lkml.kernel.org/n/tip-kt0trz0apodbf84ucjfdbr1a@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Li Zefan <lizefan@huawei.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

---
 include/linux/sched.h    |    1 
 kernel/sched/core.c      |   53 ++++++++++++++++++++++++++++++++++++++---------
 kernel/time/tick-sched.c |    1 
 3 files changed, 45 insertions(+), 10 deletions(-)

--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -144,6 +144,7 @@ extern unsigned long this_cpu_load(void)
 
 
 extern void calc_global_load(unsigned long ticks);
+extern void update_cpu_load_nohz(void);
 
 extern unsigned long get_parent_ip(unsigned long addr);
 
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2649,25 +2649,32 @@ static void __update_cpu_load(struct rq
 	sched_avg_update(this_rq);
 }
 
+#ifdef CONFIG_NO_HZ
+/*
+ * There is no sane way to deal with nohz on smp when using jiffies because the
+ * cpu doing the jiffies update might drift wrt the cpu doing the jiffy reading
+ * causing off-by-one errors in observed deltas; {0,2} instead of {1,1}.
+ *
+ * Therefore we cannot use the delta approach from the regular tick since that
+ * would seriously skew the load calculation. However we'll make do for those
+ * updates happening while idle (nohz_idle_balance) or coming out of idle
+ * (tick_nohz_idle_exit).
+ *
+ * This means we might still be one tick off for nohz periods.
+ */
+
 /*
  * Called from nohz_idle_balance() to update the load ratings before doing the
  * idle balance.
  */
 void update_idle_cpu_load(struct rq *this_rq)
 {
-	unsigned long curr_jiffies = jiffies;
+	unsigned long curr_jiffies = ACCESS_ONCE(jiffies);
 	unsigned long load = this_rq->load.weight;
 	unsigned long pending_updates;
 
 	/*
-	 * Bloody broken means of dealing with nohz, but better than nothing..
-	 * jiffies is updated by one cpu, another cpu can drift wrt the jiffy
-	 * update and see 0 difference the one time and 2 the next, even though
-	 * we ticked at roughtly the same rate.
-	 *
-	 * Hence we only use this from nohz_idle_balance() and skip this
-	 * nonsense when called from the scheduler_tick() since that's
-	 * guaranteed a stable rate.
+	 * bail if there's load or we're actually up-to-date.
 	 */
 	if (load || curr_jiffies == this_rq->last_load_update_tick)
 		return;
@@ -2679,12 +2686,38 @@ void update_idle_cpu_load(struct rq *thi
 }
 
 /*
+ * Called from tick_nohz_idle_exit() -- try and fix up the ticks we missed.
+ */
+void update_cpu_load_nohz(void)
+{
+	struct rq *this_rq = this_rq();
+	unsigned long curr_jiffies = ACCESS_ONCE(jiffies);
+	unsigned long pending_updates;
+
+	if (curr_jiffies == this_rq->last_load_update_tick)
+		return;
+
+	raw_spin_lock(&this_rq->lock);
+	pending_updates = curr_jiffies - this_rq->last_load_update_tick;
+	if (pending_updates) {
+		this_rq->last_load_update_tick = curr_jiffies;
+		/*
+		 * We were idle, this means load 0, the current load might be
+		 * !0 due to remote wakeups and the sort.
+		 */
+		__update_cpu_load(this_rq, 0, pending_updates);
+	}
+	raw_spin_unlock(&this_rq->lock);
+}
+#endif /* CONFIG_NO_HZ */
+
+/*
  * Called from scheduler_tick()
  */
 static void update_cpu_load_active(struct rq *this_rq)
 {
 	/*
-	 * See the mess in update_idle_cpu_load().
+	 * See the mess around update_idle_cpu_load() / update_cpu_load_nohz().
 	 */
 	this_rq->last_load_update_tick = jiffies;
 	__update_cpu_load(this_rq, this_rq->load.weight, 1);
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -582,6 +582,7 @@ void tick_nohz_idle_exit(void)
 	/* Update jiffies first */
 	select_nohz_load_balancer(0);
 	tick_do_update_jiffies64(now);
+	update_cpu_load_nohz();
 
 #ifndef CONFIG_VIRT_CPU_ACCOUNTING
 	/*



^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 3.4 16/24] IB/qib: Convert qib_user_sdma_pin_pages() to use get_user_pages_fast()
  2014-02-18 22:46 [PATCH 3.4 00/24] 3.4.81-stable review Greg Kroah-Hartman
                   ` (14 preceding siblings ...)
  2014-02-18 22:47 ` [PATCH 3.4 15/24] sched/nohz: Fix rq->cpu_load calculations some more Greg Kroah-Hartman
@ 2014-02-18 22:47 ` Greg Kroah-Hartman
  2014-02-18 22:47 ` [PATCH 3.4 17/24] target/file: Use O_DSYNC by default for FILEIO backends Greg Kroah-Hartman
                   ` (9 subsequent siblings)
  25 siblings, 0 replies; 28+ messages in thread
From: Greg Kroah-Hartman @ 2014-02-18 22:47 UTC (permalink / raw)
  To: linux-kernel
  Cc: Greg Kroah-Hartman, stable, Mike Marciniszyn, Jan Kara,
	Roland Dreier, Ben Hutchings

3.4-stable review patch.  If anyone has any objections, please let me know.

------------------

From: Jan Kara <jack@suse.cz>

commit 603e7729920e42b3c2f4dbfab9eef4878cb6e8fa upstream.

qib_user_sdma_queue_pkts() gets called with mmap_sem held for
writing. Except for get_user_pages() deep down in
qib_user_sdma_pin_pages() we don't seem to need mmap_sem at all.  Even
more interestingly the function qib_user_sdma_queue_pkts() (and also
qib_user_sdma_coalesce() called somewhat later) call copy_from_user()
which can hit a page fault and we deadlock on trying to get mmap_sem
when handling that fault.

So just make qib_user_sdma_pin_pages() use get_user_pages_fast() and
leave mmap_sem locking for mm.

This deadlock has actually been observed in the wild when the node
is under memory pressure.

Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Roland Dreier <roland@purestorage.com>
[Backported to 3.4: (Thank to Ben Hutchings)
 - Adjust context
 - Adjust indentation and nr_pages argument in qib_user_sdma_pin_pages()]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

---
 drivers/infiniband/hw/qib/qib_user_sdma.c |    6 +-----
 1 file changed, 1 insertion(+), 5 deletions(-)

--- a/drivers/infiniband/hw/qib/qib_user_sdma.c
+++ b/drivers/infiniband/hw/qib/qib_user_sdma.c
@@ -284,8 +284,7 @@ static int qib_user_sdma_pin_pages(const
 	int j;
 	int ret;
 
-	ret = get_user_pages(current, current->mm, addr,
-			     npages, 0, 1, pages, NULL);
+	ret = get_user_pages_fast(addr, npages, 0, pages);
 
 	if (ret != npages) {
 		int i;
@@ -830,10 +829,7 @@ int qib_user_sdma_writev(struct qib_ctxt
 	while (dim) {
 		const int mxp = 8;
 
-		down_write(&current->mm->mmap_sem);
 		ret = qib_user_sdma_queue_pkts(dd, pq, &list, iov, dim, mxp);
-		up_write(&current->mm->mmap_sem);
-
 		if (ret <= 0)
 			goto done_unlock;
 		else {



^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 3.4 17/24] target/file: Use O_DSYNC by default for FILEIO backends
  2014-02-18 22:46 [PATCH 3.4 00/24] 3.4.81-stable review Greg Kroah-Hartman
                   ` (15 preceding siblings ...)
  2014-02-18 22:47 ` [PATCH 3.4 16/24] IB/qib: Convert qib_user_sdma_pin_pages() to use get_user_pages_fast() Greg Kroah-Hartman
@ 2014-02-18 22:47 ` Greg Kroah-Hartman
  2014-02-18 22:47 ` [PATCH 3.4 18/24] target/file: Re-enable optional fd_buffered_io=1 operation Greg Kroah-Hartman
                   ` (8 subsequent siblings)
  25 siblings, 0 replies; 28+ messages in thread
From: Greg Kroah-Hartman @ 2014-02-18 22:47 UTC (permalink / raw)
  To: linux-kernel
  Cc: Greg Kroah-Hartman, stable, Christoph Hellwig, Linus Torvalds,
	Nicholas Bellinger, Ben Hutchings, Li Zefan

3.4-stable review patch.  If anyone has any objections, please let me know.

------------------

From: Nicholas Bellinger <nab@linux-iscsi.org>

commit a4dff3043c231d57f982af635c9d2192ee40e5ae upstream.

Convert to use O_DSYNC for all cases at FILEIO backend creation time to
avoid the extra syncing of pure timestamp updates with legacy O_SYNC during
default operation as recommended by hch.  Continue to do this independently of
Write Cache Enable (WCE) bit, as WCE=0 is currently the default for all backend
devices and enabled by user on per device basis via attrib/emulate_write_cache.

This patch drops the now unnecessary fd_buffered_io= token usage that was
originally signalling when to explictly disable O_SYNC at backend creation
time for buffered I/O operation.  This can end up being dangerous for a number
of reasons during physical node failure, so go ahead and drop this option
for now when O_DSYNC is used as the default.

Also allow explict FUA WRITEs -> vfs_fsync_range() call to function in
fd_execute_cmd() independently of WCE bit setting.

Reported-by: Christoph Hellwig <hch@lst.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
[bwh: Backported to 3.2:
 - We have fd_do_task() and not fd_execute_cmd()
 - Various fields are in struct se_task rather than struct se_cmd
 - fd_create_virtdevice() flags initialisation hasn't been cleaned up]
Cc: Li Zefan <lizefan@huawei.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

---
 drivers/target/target_core_file.c |   78 ++++++++------------------------------
 drivers/target/target_core_file.h |    1 
 2 files changed, 17 insertions(+), 62 deletions(-)

--- a/drivers/target/target_core_file.c
+++ b/drivers/target/target_core_file.c
@@ -133,21 +133,11 @@ static struct se_device *fd_create_virtd
 		ret = PTR_ERR(dev_p);
 		goto fail;
 	}
-#if 0
-	if (di->no_create_file)
-		flags = O_RDWR | O_LARGEFILE;
-	else
-		flags = O_RDWR | O_CREAT | O_LARGEFILE;
-#else
-	flags = O_RDWR | O_CREAT | O_LARGEFILE;
-#endif
-/*	flags |= O_DIRECT; */
 	/*
-	 * If fd_buffered_io=1 has not been set explicitly (the default),
-	 * use O_SYNC to force FILEIO writes to disk.
+	 * Use O_DSYNC by default instead of O_SYNC to forgo syncing
+	 * of pure timestamp updates.
 	 */
-	if (!(fd_dev->fbd_flags & FDBD_USE_BUFFERED_IO))
-		flags |= O_SYNC;
+	flags = O_RDWR | O_CREAT | O_LARGEFILE | O_DSYNC;
 
 	file = filp_open(dev_p, flags, 0600);
 	if (IS_ERR(file)) {
@@ -399,26 +389,6 @@ static void fd_emulate_sync_cache(struct
 		transport_complete_sync_cache(cmd, ret == 0);
 }
 
-/*
- * WRITE Force Unit Access (FUA) emulation on a per struct se_task
- * LBA range basis..
- */
-static void fd_emulate_write_fua(struct se_cmd *cmd, struct se_task *task)
-{
-	struct se_device *dev = cmd->se_dev;
-	struct fd_dev *fd_dev = dev->dev_ptr;
-	loff_t start = task->task_lba * dev->se_sub_dev->se_dev_attrib.block_size;
-	loff_t end = start + task->task_size;
-	int ret;
-
-	pr_debug("FILEIO: FUA WRITE LBA: %llu, bytes: %u\n",
-			task->task_lba, task->task_size);
-
-	ret = vfs_fsync_range(fd_dev->fd_file, start, end, 1);
-	if (ret != 0)
-		pr_err("FILEIO: vfs_fsync_range() failed: %d\n", ret);
-}
-
 static int fd_do_task(struct se_task *task)
 {
 	struct se_cmd *cmd = task->task_se_cmd;
@@ -433,19 +403,21 @@ static int fd_do_task(struct se_task *ta
 		ret = fd_do_readv(task);
 	} else {
 		ret = fd_do_writev(task);
-
+		/*
+		 * Perform implict vfs_fsync_range() for fd_do_writev() ops
+		 * for SCSI WRITEs with Forced Unit Access (FUA) set.
+		 * Allow this to happen independent of WCE=0 setting.
+		 */
 		if (ret > 0 &&
-		    dev->se_sub_dev->se_dev_attrib.emulate_write_cache > 0 &&
 		    dev->se_sub_dev->se_dev_attrib.emulate_fua_write > 0 &&
 		    (cmd->se_cmd_flags & SCF_FUA)) {
-			/*
-			 * We might need to be a bit smarter here
-			 * and return some sense data to let the initiator
-			 * know the FUA WRITE cache sync failed..?
-			 */
-			fd_emulate_write_fua(cmd, task);
-		}
+			struct fd_dev *fd_dev = dev->dev_ptr;
+			loff_t start = task->task_lba *
+				dev->se_sub_dev->se_dev_attrib.block_size;
+			loff_t end = start + task->task_size;
 
+			vfs_fsync_range(fd_dev->fd_file, start, end, 1);
+		}
 	}
 
 	if (ret < 0) {
@@ -477,7 +449,6 @@ enum {
 static match_table_t tokens = {
 	{Opt_fd_dev_name, "fd_dev_name=%s"},
 	{Opt_fd_dev_size, "fd_dev_size=%s"},
-	{Opt_fd_buffered_io, "fd_buffered_io=%d"},
 	{Opt_err, NULL}
 };
 
@@ -489,7 +460,7 @@ static ssize_t fd_set_configfs_dev_param
 	struct fd_dev *fd_dev = se_dev->se_dev_su_ptr;
 	char *orig, *ptr, *arg_p, *opts;
 	substring_t args[MAX_OPT_ARGS];
-	int ret = 0, arg, token;
+	int ret = 0, token;
 
 	opts = kstrdup(page, GFP_KERNEL);
 	if (!opts)
@@ -533,19 +504,6 @@ static ssize_t fd_set_configfs_dev_param
 					" bytes\n", fd_dev->fd_dev_size);
 			fd_dev->fbd_flags |= FBDF_HAS_SIZE;
 			break;
-		case Opt_fd_buffered_io:
-			match_int(args, &arg);
-			if (arg != 1) {
-				pr_err("bogus fd_buffered_io=%d value\n", arg);
-				ret = -EINVAL;
-				goto out;
-			}
-
-			pr_debug("FILEIO: Using buffered I/O"
-				" operations for struct fd_dev\n");
-
-			fd_dev->fbd_flags |= FDBD_USE_BUFFERED_IO;
-			break;
 		default:
 			break;
 		}
@@ -577,10 +535,8 @@ static ssize_t fd_show_configfs_dev_para
 	ssize_t bl = 0;
 
 	bl = sprintf(b + bl, "TCM FILEIO ID: %u", fd_dev->fd_dev_id);
-	bl += sprintf(b + bl, "        File: %s  Size: %llu  Mode: %s\n",
-		fd_dev->fd_dev_name, fd_dev->fd_dev_size,
-		(fd_dev->fbd_flags & FDBD_USE_BUFFERED_IO) ?
-		"Buffered" : "Synchronous");
+	bl += sprintf(b + bl, "        File: %s  Size: %llu  Mode: O_DSYNC\n",
+		fd_dev->fd_dev_name, fd_dev->fd_dev_size);
 	return bl;
 }
 
--- a/drivers/target/target_core_file.h
+++ b/drivers/target/target_core_file.h
@@ -18,7 +18,6 @@ struct fd_request {
 
 #define FBDF_HAS_PATH		0x01
 #define FBDF_HAS_SIZE		0x02
-#define FDBD_USE_BUFFERED_IO	0x04
 
 struct fd_dev {
 	u32		fbd_flags;



^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 3.4 18/24] target/file: Re-enable optional fd_buffered_io=1 operation
  2014-02-18 22:46 [PATCH 3.4 00/24] 3.4.81-stable review Greg Kroah-Hartman
                   ` (16 preceding siblings ...)
  2014-02-18 22:47 ` [PATCH 3.4 17/24] target/file: Use O_DSYNC by default for FILEIO backends Greg Kroah-Hartman
@ 2014-02-18 22:47 ` Greg Kroah-Hartman
  2014-02-18 22:47 ` [PATCH 3.4 19/24] KVM: Fix buffer overflow in kvm_set_irq() Greg Kroah-Hartman
                   ` (7 subsequent siblings)
  25 siblings, 0 replies; 28+ messages in thread
From: Greg Kroah-Hartman @ 2014-02-18 22:47 UTC (permalink / raw)
  To: linux-kernel
  Cc: Greg Kroah-Hartman, stable, Ferry, Christoph Hellwig,
	Nicholas Bellinger, Ben Hutchings, Li Zefan

3.4-stable review patch.  If anyone has any objections, please let me know.

------------------

From: Nicholas Bellinger <nab@linux-iscsi.org>

commit b32f4c7ed85c5cee2a21a55c9f59ebc9d57a2463 upstream.

This patch re-adds the ability to optionally run in buffered FILEIO mode
(eg: w/o O_DSYNC) for device backends in order to once again use the
Linux buffered cache as a write-back storage mechanism.

This logic was originally dropped with mainline v3.5-rc commit:

commit a4dff3043c231d57f982af635c9d2192ee40e5ae
Author: Nicholas Bellinger <nab@linux-iscsi.org>
Date:   Wed May 30 16:25:41 2012 -0700

    target/file: Use O_DSYNC by default for FILEIO backends

This difference with this patch is that fd_create_virtdevice() now
forces the explicit setting of emulate_write_cache=1 when buffered FILEIO
operation has been enabled.

(v2: Switch to FDBD_HAS_BUFFERED_IO_WCE + add more detailed
     comment as requested by hch)

Reported-by: Ferry <iscsitmp@bananateam.nl>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Cc: Li Zefan <lizefan@huawei.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

---
 drivers/target/target_core_file.c |   41 +++++++++++++++++++++++++++++++++++---
 drivers/target/target_core_file.h |    1 
 2 files changed, 39 insertions(+), 3 deletions(-)

--- a/drivers/target/target_core_file.c
+++ b/drivers/target/target_core_file.c
@@ -138,6 +138,19 @@ static struct se_device *fd_create_virtd
 	 * of pure timestamp updates.
 	 */
 	flags = O_RDWR | O_CREAT | O_LARGEFILE | O_DSYNC;
+	/*
+	 * Optionally allow fd_buffered_io=1 to be enabled for people
+	 * who want use the fs buffer cache as an WriteCache mechanism.
+	 *
+	 * This means that in event of a hard failure, there is a risk
+	 * of silent data-loss if the SCSI client has *not* performed a
+	 * forced unit access (FUA) write, or issued SYNCHRONIZE_CACHE
+	 * to write-out the entire device cache.
+	 */
+	if (fd_dev->fbd_flags & FDBD_HAS_BUFFERED_IO_WCE) {
+		pr_debug("FILEIO: Disabling O_DSYNC, using buffered FILEIO\n");
+		flags &= ~O_DSYNC;
+	}
 
 	file = filp_open(dev_p, flags, 0600);
 	if (IS_ERR(file)) {
@@ -205,6 +218,12 @@ static struct se_device *fd_create_virtd
 	if (!dev)
 		goto fail;
 
+	if (fd_dev->fbd_flags & FDBD_HAS_BUFFERED_IO_WCE) {
+		pr_debug("FILEIO: Forcing setting of emulate_write_cache=1"
+			" with FDBD_HAS_BUFFERED_IO_WCE\n");
+		dev->se_sub_dev->se_dev_attrib.emulate_write_cache = 1;
+	}
+
 	fd_dev->fd_dev_id = fd_host->fd_host_dev_id_count++;
 	fd_dev->fd_queue_depth = dev->queue_depth;
 
@@ -449,6 +468,7 @@ enum {
 static match_table_t tokens = {
 	{Opt_fd_dev_name, "fd_dev_name=%s"},
 	{Opt_fd_dev_size, "fd_dev_size=%s"},
+	{Opt_fd_buffered_io, "fd_buffered_io=%d"},
 	{Opt_err, NULL}
 };
 
@@ -460,7 +480,7 @@ static ssize_t fd_set_configfs_dev_param
 	struct fd_dev *fd_dev = se_dev->se_dev_su_ptr;
 	char *orig, *ptr, *arg_p, *opts;
 	substring_t args[MAX_OPT_ARGS];
-	int ret = 0, token;
+	int ret = 0, arg, token;
 
 	opts = kstrdup(page, GFP_KERNEL);
 	if (!opts)
@@ -504,6 +524,19 @@ static ssize_t fd_set_configfs_dev_param
 					" bytes\n", fd_dev->fd_dev_size);
 			fd_dev->fbd_flags |= FBDF_HAS_SIZE;
 			break;
+		case Opt_fd_buffered_io:
+			match_int(args, &arg);
+			if (arg != 1) {
+				pr_err("bogus fd_buffered_io=%d value\n", arg);
+				ret = -EINVAL;
+				goto out;
+			}
+
+			pr_debug("FILEIO: Using buffered I/O"
+				" operations for struct fd_dev\n");
+
+			fd_dev->fbd_flags |= FDBD_HAS_BUFFERED_IO_WCE;
+			break;
 		default:
 			break;
 		}
@@ -535,8 +568,10 @@ static ssize_t fd_show_configfs_dev_para
 	ssize_t bl = 0;
 
 	bl = sprintf(b + bl, "TCM FILEIO ID: %u", fd_dev->fd_dev_id);
-	bl += sprintf(b + bl, "        File: %s  Size: %llu  Mode: O_DSYNC\n",
-		fd_dev->fd_dev_name, fd_dev->fd_dev_size);
+	bl += sprintf(b + bl, "        File: %s  Size: %llu  Mode: %s\n",
+		fd_dev->fd_dev_name, fd_dev->fd_dev_size,
+		(fd_dev->fbd_flags & FDBD_HAS_BUFFERED_IO_WCE) ?
+		"Buffered-WCE" : "O_DSYNC");
 	return bl;
 }
 
--- a/drivers/target/target_core_file.h
+++ b/drivers/target/target_core_file.h
@@ -18,6 +18,7 @@ struct fd_request {
 
 #define FBDF_HAS_PATH		0x01
 #define FBDF_HAS_SIZE		0x02
+#define FDBD_HAS_BUFFERED_IO_WCE 0x04
 
 struct fd_dev {
 	u32		fbd_flags;



^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 3.4 19/24] KVM: Fix buffer overflow in kvm_set_irq()
  2014-02-18 22:46 [PATCH 3.4 00/24] 3.4.81-stable review Greg Kroah-Hartman
                   ` (17 preceding siblings ...)
  2014-02-18 22:47 ` [PATCH 3.4 18/24] target/file: Re-enable optional fd_buffered_io=1 operation Greg Kroah-Hartman
@ 2014-02-18 22:47 ` Greg Kroah-Hartman
  2014-02-18 22:47 ` [PATCH 3.4 20/24] PM / Hibernate: Hibernate/thaw fixes/improvements Greg Kroah-Hartman
                   ` (6 subsequent siblings)
  25 siblings, 0 replies; 28+ messages in thread
From: Greg Kroah-Hartman @ 2014-02-18 22:47 UTC (permalink / raw)
  To: linux-kernel
  Cc: Greg Kroah-Hartman, stable, Avi Kivity, Ben Hutchings, Li Zefan

3.4-stable review patch.  If anyone has any objections, please let me know.

------------------

From: Avi Kivity <avi@redhat.com>

commit f2ebd422f71cda9c791f76f85d2ca102ae34a1ed upstream.

kvm_set_irq() has an internal buffer of three irq routing entries, allowing
connecting a GSI to three IRQ chips or on MSI.  However setup_routing_entry()
does not properly enforce this, allowing three irqchip routes followed by
an MSI route to overflow the buffer.

Fix by ensuring that an MSI entry is added to an empty list.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Cc: Li Zefan <lizefan@huawei.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

---
 virt/kvm/irq_comm.c |    1 +
 1 file changed, 1 insertion(+)

--- a/virt/kvm/irq_comm.c
+++ b/virt/kvm/irq_comm.c
@@ -318,6 +318,7 @@ static int setup_routing_entry(struct kv
 	 */
 	hlist_for_each_entry(ei, n, &rt->map[ue->gsi], link)
 		if (ei->type == KVM_IRQ_ROUTING_MSI ||
+		    ue->type == KVM_IRQ_ROUTING_MSI ||
 		    ue->u.irqchip.irqchip == ei->irqchip.irqchip)
 			return r;
 



^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 3.4 20/24] PM / Hibernate: Hibernate/thaw fixes/improvements
  2014-02-18 22:46 [PATCH 3.4 00/24] 3.4.81-stable review Greg Kroah-Hartman
                   ` (18 preceding siblings ...)
  2014-02-18 22:47 ` [PATCH 3.4 19/24] KVM: Fix buffer overflow in kvm_set_irq() Greg Kroah-Hartman
@ 2014-02-18 22:47 ` Greg Kroah-Hartman
  2014-02-18 22:47 ` [PATCH 3.4 21/24] Input: synaptics - handle out of bounds values from the hardware Greg Kroah-Hartman
                   ` (5 subsequent siblings)
  25 siblings, 0 replies; 28+ messages in thread
From: Greg Kroah-Hartman @ 2014-02-18 22:47 UTC (permalink / raw)
  To: linux-kernel
  Cc: Greg Kroah-Hartman, stable, Bojan Smojver, Rafael J. Wysocki,
	Ben Hutchings, Li Zefan

3.4-stable review patch.  If anyone has any objections, please let me know.

------------------

From: Bojan Smojver <bojan@rexursive.com>

commit 5a21d489fd9541a4a66b9a500659abaca1b19a51 upstream.

 1. Do not allocate memory for buffers from emergency pools, unless
    absolutely required. Do not warn about and do not retry non-essential
    failed allocations.

 2. Do not check the amount of free pages left on every single page
    write, but wait until one map is completely populated and then check.

 3. Set maximum number of pages for read buffering consistently, instead
    of inadvertently depending on the size of the sector type.

 4. Fix copyright line, which I missed when I submitted the hibernation
    threading patch.

 5. Dispense with bit shifting arithmetic to improve readability.

 6. Really recalculate the number of pages required to be free after all
    allocations have been done.

 7. Fix calculation of pages required for read buffering. Only count in
    pages that do not belong to high memory.

Signed-off-by: Bojan Smojver <bojan@rexursive.com>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Cc: Li Zefan <lizefan@huawei.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

---
 kernel/power/swap.c |   62 ++++++++++++++++++++++++++++++++--------------------
 1 file changed, 39 insertions(+), 23 deletions(-)

--- a/kernel/power/swap.c
+++ b/kernel/power/swap.c
@@ -6,7 +6,7 @@
  *
  * Copyright (C) 1998,2001-2005 Pavel Machek <pavel@ucw.cz>
  * Copyright (C) 2006 Rafael J. Wysocki <rjw@sisk.pl>
- * Copyright (C) 2010 Bojan Smojver <bojan@rexursive.com>
+ * Copyright (C) 2010-2012 Bojan Smojver <bojan@rexursive.com>
  *
  * This file is released under the GPLv2.
  *
@@ -282,14 +282,17 @@ static int write_page(void *buf, sector_
 		return -ENOSPC;
 
 	if (bio_chain) {
-		src = (void *)__get_free_page(__GFP_WAIT | __GFP_HIGH);
+		src = (void *)__get_free_page(__GFP_WAIT | __GFP_NOWARN |
+		                              __GFP_NORETRY);
 		if (src) {
 			copy_page(src, buf);
 		} else {
 			ret = hib_wait_on_bio_chain(bio_chain); /* Free pages */
 			if (ret)
 				return ret;
-			src = (void *)__get_free_page(__GFP_WAIT | __GFP_HIGH);
+			src = (void *)__get_free_page(__GFP_WAIT |
+			                              __GFP_NOWARN |
+			                              __GFP_NORETRY);
 			if (src) {
 				copy_page(src, buf);
 			} else {
@@ -367,12 +370,17 @@ static int swap_write_page(struct swap_m
 		clear_page(handle->cur);
 		handle->cur_swap = offset;
 		handle->k = 0;
-	}
-	if (bio_chain && low_free_pages() <= handle->reqd_free_pages) {
-		error = hib_wait_on_bio_chain(bio_chain);
-		if (error)
-			goto out;
-		handle->reqd_free_pages = reqd_free_pages();
+
+		if (bio_chain && low_free_pages() <= handle->reqd_free_pages) {
+			error = hib_wait_on_bio_chain(bio_chain);
+			if (error)
+				goto out;
+			/*
+			 * Recalculate the number of required free pages, to
+			 * make sure we never take more than half.
+			 */
+			handle->reqd_free_pages = reqd_free_pages();
+		}
 	}
  out:
 	return error;
@@ -419,8 +427,9 @@ static int swap_writer_finish(struct swa
 /* Maximum number of threads for compression/decompression. */
 #define LZO_THREADS	3
 
-/* Maximum number of pages for read buffering. */
-#define LZO_READ_PAGES	(MAP_PAGE_ENTRIES * 8)
+/* Minimum/maximum number of pages for read buffering. */
+#define LZO_MIN_RD_PAGES	1024
+#define LZO_MAX_RD_PAGES	8192
 
 
 /**
@@ -631,12 +640,6 @@ static int save_image_lzo(struct swap_ma
 	}
 
 	/*
-	 * Adjust number of free pages after all allocations have been done.
-	 * We don't want to run out of pages when writing.
-	 */
-	handle->reqd_free_pages = reqd_free_pages();
-
-	/*
 	 * Start the CRC32 thread.
 	 */
 	init_waitqueue_head(&crc->go);
@@ -657,6 +660,12 @@ static int save_image_lzo(struct swap_ma
 		goto out_clean;
 	}
 
+	/*
+	 * Adjust the number of required free pages after all allocations have
+	 * been done. We don't want to run out of pages when writing.
+	 */
+	handle->reqd_free_pages = reqd_free_pages();
+
 	printk(KERN_INFO
 		"PM: Using %u thread(s) for compression.\n"
 		"PM: Compressing and saving image data (%u pages) ...     ",
@@ -1067,7 +1076,7 @@ static int load_image_lzo(struct swap_ma
 	unsigned i, thr, run_threads, nr_threads;
 	unsigned ring = 0, pg = 0, ring_size = 0,
 	         have = 0, want, need, asked = 0;
-	unsigned long read_pages;
+	unsigned long read_pages = 0;
 	unsigned char **page = NULL;
 	struct dec_data *data = NULL;
 	struct crc_data *crc = NULL;
@@ -1079,7 +1088,7 @@ static int load_image_lzo(struct swap_ma
 	nr_threads = num_online_cpus() - 1;
 	nr_threads = clamp_val(nr_threads, 1, LZO_THREADS);
 
-	page = vmalloc(sizeof(*page) * LZO_READ_PAGES);
+	page = vmalloc(sizeof(*page) * LZO_MAX_RD_PAGES);
 	if (!page) {
 		printk(KERN_ERR "PM: Failed to allocate LZO page\n");
 		ret = -ENOMEM;
@@ -1144,15 +1153,22 @@ static int load_image_lzo(struct swap_ma
 	}
 
 	/*
-	 * Adjust number of pages for read buffering, in case we are short.
+	 * Set the number of pages for read buffering.
+	 * This is complete guesswork, because we'll only know the real
+	 * picture once prepare_image() is called, which is much later on
+	 * during the image load phase. We'll assume the worst case and
+	 * say that none of the image pages are from high memory.
 	 */
-	read_pages = (nr_free_pages() - snapshot_get_image_size()) >> 1;
-	read_pages = clamp_val(read_pages, LZO_CMP_PAGES, LZO_READ_PAGES);
+	if (low_free_pages() > snapshot_get_image_size())
+		read_pages = (low_free_pages() - snapshot_get_image_size()) / 2;
+	read_pages = clamp_val(read_pages, LZO_MIN_RD_PAGES, LZO_MAX_RD_PAGES);
 
 	for (i = 0; i < read_pages; i++) {
 		page[i] = (void *)__get_free_page(i < LZO_CMP_PAGES ?
 		                                  __GFP_WAIT | __GFP_HIGH :
-		                                  __GFP_WAIT);
+		                                  __GFP_WAIT | __GFP_NOWARN |
+		                                  __GFP_NORETRY);
+
 		if (!page[i]) {
 			if (i < LZO_CMP_PAGES) {
 				ring_size = i;



^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 3.4 21/24] Input: synaptics - handle out of bounds values from the hardware
  2014-02-18 22:46 [PATCH 3.4 00/24] 3.4.81-stable review Greg Kroah-Hartman
                   ` (19 preceding siblings ...)
  2014-02-18 22:47 ` [PATCH 3.4 20/24] PM / Hibernate: Hibernate/thaw fixes/improvements Greg Kroah-Hartman
@ 2014-02-18 22:47 ` Greg Kroah-Hartman
  2014-02-18 22:47   ` Greg Kroah-Hartman
                   ` (4 subsequent siblings)
  25 siblings, 0 replies; 28+ messages in thread
From: Greg Kroah-Hartman @ 2014-02-18 22:47 UTC (permalink / raw)
  To: linux-kernel
  Cc: Greg Kroah-Hartman, stable, Seth Forshee, Daniel Kurtz,
	Dmitry Torokhov, Ben Hutchings, Li Zefan

3.4-stable review patch.  If anyone has any objections, please let me know.

------------------

From: Seth Forshee <seth.forshee@canonical.com>

commit c0394506e69b37c47d391c2a7bbea3ea236d8ec8 upstream.

The touchpad on the Acer Aspire One D250 will report out of range values
in the extreme lower portion of the touchpad. These appear as abrupt
changes in the values reported by the hardware from very low values to
very high values, which can cause unexpected vertical jumps in the
position of the mouse pointer.

What seems to be happening is that the value is wrapping to a two's
compliment negative value of higher resolution than the 13-bit value
reported by the hardware, with the high-order bits being truncated. This
patch adds handling for these values by converting them to the
appropriate negative values.

The only tricky part about this is deciding when to treat a number as
negative. It stands to reason that if out of range values can be
reported on the low end then it could also happen on the high end, so
not all out of range values should be treated as negative. The approach
taken here is to split the difference between the maximum legitimate
value for the axis and the maximum possible value that the hardware can
report, treating values greater than this number as negative and all
other values as positive. This can be tweaked later if hardware is found
that operates outside of these parameters.

BugLink: http://bugs.launchpad.net/bugs/1001251
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>
Reviewed-by: Daniel Kurtz <djkurtz@chromium.org>
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
[bwh: Backported to 3.2: adjust context]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Cc: Li Zefan <lizefan@huawei.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

---
 drivers/input/mouse/synaptics.c |   23 +++++++++++++++++++++++
 1 file changed, 23 insertions(+)

--- a/drivers/input/mouse/synaptics.c
+++ b/drivers/input/mouse/synaptics.c
@@ -40,11 +40,28 @@
  * Note that newer firmware allows querying device for maximum useable
  * coordinates.
  */
+#define XMIN 0
+#define XMAX 6143
+#define YMIN 0
+#define YMAX 6143
 #define XMIN_NOMINAL 1472
 #define XMAX_NOMINAL 5472
 #define YMIN_NOMINAL 1408
 #define YMAX_NOMINAL 4448
 
+/* Size in bits of absolute position values reported by the hardware */
+#define ABS_POS_BITS 13
+
+/*
+ * Any position values from the hardware above the following limits are
+ * treated as "wrapped around negative" values that have been truncated to
+ * the 13-bit reporting range of the hardware. These are just reasonable
+ * guesses and can be adjusted if hardware is found that operates outside
+ * of these parameters.
+ */
+#define X_MAX_POSITIVE (((1 << ABS_POS_BITS) + XMAX) / 2)
+#define Y_MAX_POSITIVE (((1 << ABS_POS_BITS) + YMAX) / 2)
+
 /*
  * Synaptics touchpads report the y coordinate from bottom to top, which is
  * opposite from what userspace expects.
@@ -555,6 +572,12 @@ static int synaptics_parse_hw_state(cons
 		hw->right = (buf[0] & 0x02) ? 1 : 0;
 	}
 
+	/* Convert wrap-around values to negative */
+	if (hw->x > X_MAX_POSITIVE)
+		hw->x -= 1 << ABS_POS_BITS;
+	if (hw->y > Y_MAX_POSITIVE)
+		hw->y -= 1 << ABS_POS_BITS;
+
 	return 0;
 }
 



^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 3.4 22/24] virtio-blk: Use block layer provided spinlock
  2014-02-18 22:46 [PATCH 3.4 00/24] 3.4.81-stable review Greg Kroah-Hartman
@ 2014-02-18 22:47   ` Greg Kroah-Hartman
  2014-02-18 22:46 ` [PATCH 3.4 02/24] mm: __set_page_dirty_nobuffers() uses spin_lock_irqsave() instead of spin_lock_irq() Greg Kroah-Hartman
                     ` (24 subsequent siblings)
  25 siblings, 0 replies; 28+ messages in thread
From: Greg Kroah-Hartman @ 2014-02-18 22:47 UTC (permalink / raw)
  To: linux-kernel
  Cc: Greg Kroah-Hartman, stable, virtualization, kvm, Asias He,
	Michael S. Tsirkin, Rusty Russell, Ben Hutchings, Li Zefan

3.4-stable review patch.  If anyone has any objections, please let me know.

------------------

From: Asias He <asias@redhat.com>

commit 2c95a3290919541b846bee3e0fbaa75860929f53 upstream.

Block layer will allocate a spinlock for the queue if the driver does
not provide one in blk_init_queue().

The reason to use the internal spinlock is that blk_cleanup_queue() will
switch to use the internal spinlock in the cleanup code path.

        if (q->queue_lock != &q->__queue_lock)
                q->queue_lock = &q->__queue_lock;

However, processes which are in D state might have taken the driver
provided spinlock, when the processes wake up, they would release the
block provided spinlock.

=====================================
[ BUG: bad unlock balance detected! ]
3.4.0-rc7+ #238 Not tainted
-------------------------------------
fio/3587 is trying to release lock (&(&q->__queue_lock)->rlock) at:
[<ffffffff813274d2>] blk_queue_bio+0x2a2/0x380
but there are no more locks to release!

other info that might help us debug this:
1 lock held by fio/3587:
 #0:  (&(&vblk->lock)->rlock){......}, at:
[<ffffffff8132661a>] get_request_wait+0x19a/0x250

Other drivers use block layer provided spinlock as well, e.g. SCSI.

Switching to the block layer provided spinlock saves a bit of memory and
does not increase lock contention. Performance test shows no real
difference is observed before and after this patch.

Changes in v2: Improve commit log as Michael suggested.

Cc: virtualization@lists.linux-foundation.org
Cc: kvm@vger.kernel.org
Signed-off-by: Asias He <asias@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
[bwh: Backported to 3.2: adjust context]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Cc: Li Zefan <lizefan@huawei.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

---
 drivers/block/virtio_blk.c |    9 +++------
 1 file changed, 3 insertions(+), 6 deletions(-)

--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -21,8 +21,6 @@ struct workqueue_struct *virtblk_wq;
 
 struct virtio_blk
 {
-	spinlock_t lock;
-
 	struct virtio_device *vdev;
 	struct virtqueue *vq;
 
@@ -69,7 +67,7 @@ static void blk_done(struct virtqueue *v
 	unsigned int len;
 	unsigned long flags;
 
-	spin_lock_irqsave(&vblk->lock, flags);
+	spin_lock_irqsave(vblk->disk->queue->queue_lock, flags);
 	while ((vbr = virtqueue_get_buf(vblk->vq, &len)) != NULL) {
 		int error;
 
@@ -104,7 +102,7 @@ static void blk_done(struct virtqueue *v
 	}
 	/* In case queue is stopped waiting for more buffers. */
 	blk_start_queue(vblk->disk->queue);
-	spin_unlock_irqrestore(&vblk->lock, flags);
+	spin_unlock_irqrestore(vblk->disk->queue->queue_lock, flags);
 }
 
 static bool do_req(struct request_queue *q, struct virtio_blk *vblk,
@@ -438,7 +436,6 @@ static int __devinit virtblk_probe(struc
 	}
 
 	INIT_LIST_HEAD(&vblk->reqs);
-	spin_lock_init(&vblk->lock);
 	vblk->vdev = vdev;
 	vblk->sg_elems = sg_elems;
 	sg_init_table(vblk->sg, vblk->sg_elems);
@@ -463,7 +460,7 @@ static int __devinit virtblk_probe(struc
 		goto out_mempool;
 	}
 
-	q = vblk->disk->queue = blk_init_queue(do_virtblk_request, &vblk->lock);
+	q = vblk->disk->queue = blk_init_queue(do_virtblk_request, NULL);
 	if (!q) {
 		err = -ENOMEM;
 		goto out_put_disk;



^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 3.4 22/24] virtio-blk: Use block layer provided spinlock
@ 2014-02-18 22:47   ` Greg Kroah-Hartman
  0 siblings, 0 replies; 28+ messages in thread
From: Greg Kroah-Hartman @ 2014-02-18 22:47 UTC (permalink / raw)
  To: linux-kernel
  Cc: kvm, Michael S. Tsirkin, Greg Kroah-Hartman, stable,
	virtualization, Li Zefan, Asias He, Ben Hutchings

3.4-stable review patch.  If anyone has any objections, please let me know.

------------------

From: Asias He <asias@redhat.com>

commit 2c95a3290919541b846bee3e0fbaa75860929f53 upstream.

Block layer will allocate a spinlock for the queue if the driver does
not provide one in blk_init_queue().

The reason to use the internal spinlock is that blk_cleanup_queue() will
switch to use the internal spinlock in the cleanup code path.

        if (q->queue_lock != &q->__queue_lock)
                q->queue_lock = &q->__queue_lock;

However, processes which are in D state might have taken the driver
provided spinlock, when the processes wake up, they would release the
block provided spinlock.

=====================================
[ BUG: bad unlock balance detected! ]
3.4.0-rc7+ #238 Not tainted
-------------------------------------
fio/3587 is trying to release lock (&(&q->__queue_lock)->rlock) at:
[<ffffffff813274d2>] blk_queue_bio+0x2a2/0x380
but there are no more locks to release!

other info that might help us debug this:
1 lock held by fio/3587:
 #0:  (&(&vblk->lock)->rlock){......}, at:
[<ffffffff8132661a>] get_request_wait+0x19a/0x250

Other drivers use block layer provided spinlock as well, e.g. SCSI.

Switching to the block layer provided spinlock saves a bit of memory and
does not increase lock contention. Performance test shows no real
difference is observed before and after this patch.

Changes in v2: Improve commit log as Michael suggested.

Cc: virtualization@lists.linux-foundation.org
Cc: kvm@vger.kernel.org
Signed-off-by: Asias He <asias@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
[bwh: Backported to 3.2: adjust context]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Cc: Li Zefan <lizefan@huawei.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

---
 drivers/block/virtio_blk.c |    9 +++------
 1 file changed, 3 insertions(+), 6 deletions(-)

--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -21,8 +21,6 @@ struct workqueue_struct *virtblk_wq;
 
 struct virtio_blk
 {
-	spinlock_t lock;
-
 	struct virtio_device *vdev;
 	struct virtqueue *vq;
 
@@ -69,7 +67,7 @@ static void blk_done(struct virtqueue *v
 	unsigned int len;
 	unsigned long flags;
 
-	spin_lock_irqsave(&vblk->lock, flags);
+	spin_lock_irqsave(vblk->disk->queue->queue_lock, flags);
 	while ((vbr = virtqueue_get_buf(vblk->vq, &len)) != NULL) {
 		int error;
 
@@ -104,7 +102,7 @@ static void blk_done(struct virtqueue *v
 	}
 	/* In case queue is stopped waiting for more buffers. */
 	blk_start_queue(vblk->disk->queue);
-	spin_unlock_irqrestore(&vblk->lock, flags);
+	spin_unlock_irqrestore(vblk->disk->queue->queue_lock, flags);
 }
 
 static bool do_req(struct request_queue *q, struct virtio_blk *vblk,
@@ -438,7 +436,6 @@ static int __devinit virtblk_probe(struc
 	}
 
 	INIT_LIST_HEAD(&vblk->reqs);
-	spin_lock_init(&vblk->lock);
 	vblk->vdev = vdev;
 	vblk->sg_elems = sg_elems;
 	sg_init_table(vblk->sg, vblk->sg_elems);
@@ -463,7 +460,7 @@ static int __devinit virtblk_probe(struc
 		goto out_mempool;
 	}
 
-	q = vblk->disk->queue = blk_init_queue(do_virtblk_request, &vblk->lock);
+	q = vblk->disk->queue = blk_init_queue(do_virtblk_request, NULL);
 	if (!q) {
 		err = -ENOMEM;
 		goto out_put_disk;

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 3.4 23/24] lib/vsprintf.c: kptr_restrict: fix pK-error in SysRq show-all-timers(Q)
  2014-02-18 22:46 [PATCH 3.4 00/24] 3.4.81-stable review Greg Kroah-Hartman
                   ` (21 preceding siblings ...)
  2014-02-18 22:47   ` Greg Kroah-Hartman
@ 2014-02-18 22:47 ` Greg Kroah-Hartman
  2014-02-18 22:47 ` [PATCH 3.4 24/24] nfs: tear down caches in nfs_init_writepagecache when allocation fails Greg Kroah-Hartman
                   ` (2 subsequent siblings)
  25 siblings, 0 replies; 28+ messages in thread
From: Greg Kroah-Hartman @ 2014-02-18 22:47 UTC (permalink / raw)
  To: linux-kernel
  Cc: Greg Kroah-Hartman, stable, Stevie Trujillo, Dan Rosenberg,
	Andrew Morton, Linus Torvalds, Ben Hutchings, Li Zefan

3.4-stable review patch.  If anyone has any objections, please let me know.

------------------

From: Dan Rosenberg <drosenberg@vsecurity.com>

commit 3715c5309f6d175c3053672b73fd4f73be16fd07 upstream.

When using ALT+SysRq+Q all the pointers are replaced with "pK-error" like
this:

	[23153.208033]   .base:               pK-error

with echo h > /proc/sysrq-trigger it works:

	[23107.776363]   .base:       ffff88023e60d540

The intent behind this behavior was to return "pK-error" in cases where
the %pK format specifier was used in interrupt context, because the
CAP_SYSLOG check wouldn't be meaningful.  Clearly this should only apply
when kptr_restrict is actually enabled though.

Reported-by: Stevie Trujillo <stevie.trujillo@gmail.com>
Signed-off-by: Dan Rosenberg <dan.j.rosenberg@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Cc: Li Zefan <lizefan@huawei.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

---
 lib/vsprintf.c |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

--- a/lib/vsprintf.c
+++ b/lib/vsprintf.c
@@ -926,7 +926,8 @@ char *pointer(const char *fmt, char *buf
 		 * %pK cannot be used in IRQ context because its test
 		 * for CAP_SYSLOG would be meaningless.
 		 */
-		if (in_irq() || in_serving_softirq() || in_nmi()) {
+		if (kptr_restrict && (in_irq() || in_serving_softirq() ||
+				      in_nmi())) {
 			if (spec.field_width == -1)
 				spec.field_width = 2 * sizeof(void *);
 			return string(buf, end, "pK-error", spec);



^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 3.4 24/24] nfs: tear down caches in nfs_init_writepagecache when allocation fails
  2014-02-18 22:46 [PATCH 3.4 00/24] 3.4.81-stable review Greg Kroah-Hartman
                   ` (22 preceding siblings ...)
  2014-02-18 22:47 ` [PATCH 3.4 23/24] lib/vsprintf.c: kptr_restrict: fix pK-error in SysRq show-all-timers(Q) Greg Kroah-Hartman
@ 2014-02-18 22:47 ` Greg Kroah-Hartman
  2014-02-19  4:26 ` [PATCH 3.4 00/24] 3.4.81-stable review Guenter Roeck
  2014-02-20  0:03 ` Shuah Khan
  25 siblings, 0 replies; 28+ messages in thread
From: Greg Kroah-Hartman @ 2014-02-18 22:47 UTC (permalink / raw)
  To: linux-kernel
  Cc: Greg Kroah-Hartman, stable, Bryan Schumaker, Jeff Layton,
	Trond Myklebust, Ben Hutchings, Li Zefan

3.4-stable review patch.  If anyone has any objections, please let me know.

------------------

From: Jeff Layton <jlayton@redhat.com>

commit 3dd4765fce04c0b4af1e0bc4c0b10f906f95fabc upstream.

...and ensure that we tear down the nfs_commit_data cache too when
unloading the module.

Cc: Bryan Schumaker <bjschuma@netapp.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
[bwh: Backported to 3.2: drop the nfs_cdata_cachep cleanup; it doesn't exist]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Cc: Li Zefan <lizefan@huawei.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

---
 fs/nfs/write.c |   10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -1751,12 +1751,12 @@ int __init nfs_init_writepagecache(void)
 	nfs_wdata_mempool = mempool_create_slab_pool(MIN_POOL_WRITE,
 						     nfs_wdata_cachep);
 	if (nfs_wdata_mempool == NULL)
-		return -ENOMEM;
+		goto out_destroy_write_cache;
 
 	nfs_commit_mempool = mempool_create_slab_pool(MIN_POOL_COMMIT,
 						      nfs_wdata_cachep);
 	if (nfs_commit_mempool == NULL)
-		return -ENOMEM;
+		goto out_destroy_write_mempool;
 
 	/*
 	 * NFS congestion size, scale with available memory.
@@ -1779,6 +1779,12 @@ int __init nfs_init_writepagecache(void)
 		nfs_congestion_kb = 256*1024;
 
 	return 0;
+
+out_destroy_write_mempool:
+	mempool_destroy(nfs_wdata_mempool);
+out_destroy_write_cache:
+	kmem_cache_destroy(nfs_wdata_cachep);
+	return -ENOMEM;
 }
 
 void nfs_destroy_writepagecache(void)



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 3.4 00/24] 3.4.81-stable review
  2014-02-18 22:46 [PATCH 3.4 00/24] 3.4.81-stable review Greg Kroah-Hartman
                   ` (23 preceding siblings ...)
  2014-02-18 22:47 ` [PATCH 3.4 24/24] nfs: tear down caches in nfs_init_writepagecache when allocation fails Greg Kroah-Hartman
@ 2014-02-19  4:26 ` Guenter Roeck
  2014-02-20  0:03 ` Shuah Khan
  25 siblings, 0 replies; 28+ messages in thread
From: Guenter Roeck @ 2014-02-19  4:26 UTC (permalink / raw)
  To: Greg Kroah-Hartman, linux-kernel; +Cc: torvalds, akpm, stable, lizefan

On 02/18/2014 02:46 PM, Greg Kroah-Hartman wrote:
> Many thanks to Li Zefan for digging up a bunch of these patches, that
> work is much appreciated.
>
> This is the start of the stable review cycle for the 3.4.81 release.
> There are 24 patches in this series, all will be posted as a response
> to this one.  If anyone has any issues with these being applied, please
> let me know.
>

Build results:
	total: 119 pass: 97 skipped: 18 fail: 4

qemu tests all passed.

Details are available at http://server.roeck-us.net:8010/builders.

Results are as expected.

Guenter


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 3.4 00/24] 3.4.81-stable review
  2014-02-18 22:46 [PATCH 3.4 00/24] 3.4.81-stable review Greg Kroah-Hartman
                   ` (24 preceding siblings ...)
  2014-02-19  4:26 ` [PATCH 3.4 00/24] 3.4.81-stable review Guenter Roeck
@ 2014-02-20  0:03 ` Shuah Khan
  25 siblings, 0 replies; 28+ messages in thread
From: Shuah Khan @ 2014-02-20  0:03 UTC (permalink / raw)
  To: Greg Kroah-Hartman, linux-kernel
  Cc: torvalds, akpm, stable, lizefan, Shuah Khan, shuahkhan

On 02/18/2014 03:46 PM, Greg Kroah-Hartman wrote:
> Many thanks to Li Zefan for digging up a bunch of these patches, that
> work is much appreciated.
>
> This is the start of the stable review cycle for the 3.4.81 release.
> There are 24 patches in this series, all will be posted as a response
> to this one.  If anyone has any issues with these being applied, please
> let me know.
>
> Responses should be made by Thu Feb 20 22:45:38 UTC 2014.
> Anything received after that time might be too late.
>
> The whole patch series can be found in one patch at:
> 	kernel.org/pub/linux/kernel/v3.0/stable-review/patch-3.4.81-rc1.gz
> and the diffstat can be found below.
>
> thanks,
>
> greg k-h
>

Compile tests and boot tests passed. No dmesg regressions: emerg, crit, 
alert, err are clean. No regressions in warn.

-- Shuah


-- 
Shuah Khan
Senior Linux Kernel Developer - Open Source Group
Samsung Research America(Silicon Valley)
shuah.kh@samsung.com | (970) 672-0658

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2014-02-20  0:03 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-02-18 22:46 [PATCH 3.4 00/24] 3.4.81-stable review Greg Kroah-Hartman
2014-02-18 22:46 ` [PATCH 3.4 01/24] SELinux: Fix kernel BUG on empty security contexts Greg Kroah-Hartman
2014-02-18 22:46 ` [PATCH 3.4 02/24] mm: __set_page_dirty_nobuffers() uses spin_lock_irqsave() instead of spin_lock_irq() Greg Kroah-Hartman
2014-02-18 22:46 ` [PATCH 3.4 03/24] mm: __set_page_dirty uses spin_lock_irqsave instead of spin_lock_irq Greg Kroah-Hartman
2014-02-18 22:46 ` [PATCH 3.4 04/24] x86, hweight: Fix BUG when booting with CONFIG_GCOV_PROFILE_ALL=y Greg Kroah-Hartman
2014-02-18 22:46 ` [PATCH 3.4 05/24] printk: Fix scheduling-while-atomic problem in console_cpu_notify() Greg Kroah-Hartman
2014-02-18 22:46 ` [PATCH 3.4 06/24] ext4: protect group inode free counting with group lock Greg Kroah-Hartman
2014-02-18 22:46 ` [PATCH 3.4 07/24] drm/i915: kick any firmware framebuffers before claiming the gtt Greg Kroah-Hartman
2014-02-18 22:46 ` [PATCH 3.4 08/24] mm/page_alloc.c: remove pageblock_default_order() Greg Kroah-Hartman
2014-02-18 22:46 ` [PATCH 3.4 09/24] mm: setup pageblock_order before its used by sparsemem Greg Kroah-Hartman
2014-02-18 22:46 ` [PATCH 3.4 10/24] dm sysfs: fix a module unload race Greg Kroah-Hartman
2014-02-18 22:46 ` [PATCH 3.4 11/24] ftrace: Synchronize setting function_trace_op with ftrace_trace_function Greg Kroah-Hartman
2014-02-18 22:46 ` [PATCH 3.4 12/24] ftrace: Fix synchronization location disabling and freeing ftrace_ops Greg Kroah-Hartman
2014-02-18 22:47 ` [PATCH 3.4 13/24] ftrace: Have function graph only trace based on global_ops filters Greg Kroah-Hartman
2014-02-18 22:47 ` [PATCH 3.4 14/24] sched/nohz: Fix rq->cpu_load[] calculations Greg Kroah-Hartman
2014-02-18 22:47 ` [PATCH 3.4 15/24] sched/nohz: Fix rq->cpu_load calculations some more Greg Kroah-Hartman
2014-02-18 22:47 ` [PATCH 3.4 16/24] IB/qib: Convert qib_user_sdma_pin_pages() to use get_user_pages_fast() Greg Kroah-Hartman
2014-02-18 22:47 ` [PATCH 3.4 17/24] target/file: Use O_DSYNC by default for FILEIO backends Greg Kroah-Hartman
2014-02-18 22:47 ` [PATCH 3.4 18/24] target/file: Re-enable optional fd_buffered_io=1 operation Greg Kroah-Hartman
2014-02-18 22:47 ` [PATCH 3.4 19/24] KVM: Fix buffer overflow in kvm_set_irq() Greg Kroah-Hartman
2014-02-18 22:47 ` [PATCH 3.4 20/24] PM / Hibernate: Hibernate/thaw fixes/improvements Greg Kroah-Hartman
2014-02-18 22:47 ` [PATCH 3.4 21/24] Input: synaptics - handle out of bounds values from the hardware Greg Kroah-Hartman
2014-02-18 22:47 ` [PATCH 3.4 22/24] virtio-blk: Use block layer provided spinlock Greg Kroah-Hartman
2014-02-18 22:47   ` Greg Kroah-Hartman
2014-02-18 22:47 ` [PATCH 3.4 23/24] lib/vsprintf.c: kptr_restrict: fix pK-error in SysRq show-all-timers(Q) Greg Kroah-Hartman
2014-02-18 22:47 ` [PATCH 3.4 24/24] nfs: tear down caches in nfs_init_writepagecache when allocation fails Greg Kroah-Hartman
2014-02-19  4:26 ` [PATCH 3.4 00/24] 3.4.81-stable review Guenter Roeck
2014-02-20  0:03 ` Shuah Khan

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.