All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jiri Slaby <jslaby@suse.cz>
To: stable@vger.kernel.org
Cc: linux-kernel@vger.kernel.org, Gu Zheng <guz.fnst@cn.fujitsu.com>,
	Benjamin LaHaise <bcrl@kvack.org>, Jiri Slaby <jslaby@suse.cz>
Subject: [PATCH 3.12 44/66] aio: fix uncorrent dirty pages accouting when truncating AIO ring buffer
Date: Sat,  6 Dec 2014 16:07:36 +0100	[thread overview]
Message-ID: <cab86d8c6534900243994bb0558e66a41fcbc39d.1417878427.git.jslaby@suse.cz> (raw)
In-Reply-To: <d278ba6471641f99eda3b3c76f8414339c9dbed0.1417878427.git.jslaby@suse.cz>
In-Reply-To: <cover.1417878427.git.jslaby@suse.cz>

From: Gu Zheng <guz.fnst@cn.fujitsu.com>

3.12-stable review patch.  If anyone has any objections, please let me know.

===============

commit 835f252c6debd204fcd607c79975089b1ecd3472 upstream.

https://bugzilla.kernel.org/show_bug.cgi?id=86831

Markus reported that when shutting down mysqld (with AIO support,
on a ext3 formatted Harddrive) leads to a negative number of dirty pages
(underrun to the counter). The negative number results in a drastic reduction
of the write performance because the page cache is not used, because the kernel
thinks it is still 2 ^ 32 dirty pages open.

Add a warn trace in __dec_zone_state will catch this easily:

static inline void __dec_zone_state(struct zone *zone, enum
	zone_stat_item item)
{
     atomic_long_dec(&zone->vm_stat[item]);
+    WARN_ON_ONCE(item == NR_FILE_DIRTY &&
	atomic_long_read(&zone->vm_stat[item]) < 0);
     atomic_long_dec(&vm_stat[item]);
}

[   21.341632] ------------[ cut here ]------------
[   21.346294] WARNING: CPU: 0 PID: 309 at include/linux/vmstat.h:242
cancel_dirty_page+0x164/0x224()
[   21.355296] Modules linked in: wutbox_cp sata_mv
[   21.359968] CPU: 0 PID: 309 Comm: kworker/0:1 Not tainted 3.14.21-WuT #80
[   21.366793] Workqueue: events free_ioctx
[   21.370760] [<c0016a64>] (unwind_backtrace) from [<c0012f88>]
(show_stack+0x20/0x24)
[   21.378562] [<c0012f88>] (show_stack) from [<c03f8ccc>]
(dump_stack+0x24/0x28)
[   21.385840] [<c03f8ccc>] (dump_stack) from [<c0023ae4>]
(warn_slowpath_common+0x84/0x9c)
[   21.393976] [<c0023ae4>] (warn_slowpath_common) from [<c0023bb8>]
(warn_slowpath_null+0x2c/0x34)
[   21.402800] [<c0023bb8>] (warn_slowpath_null) from [<c00c0688>]
(cancel_dirty_page+0x164/0x224)
[   21.411524] [<c00c0688>] (cancel_dirty_page) from [<c00c080c>]
(truncate_inode_page+0x8c/0x158)
[   21.420272] [<c00c080c>] (truncate_inode_page) from [<c00c0a94>]
(truncate_inode_pages_range+0x11c/0x53c)
[   21.429890] [<c00c0a94>] (truncate_inode_pages_range) from
[<c00c0f6c>] (truncate_pagecache+0x88/0xac)
[   21.439252] [<c00c0f6c>] (truncate_pagecache) from [<c00c0fec>]
(truncate_setsize+0x5c/0x74)
[   21.447731] [<c00c0fec>] (truncate_setsize) from [<c013b3a8>]
(put_aio_ring_file.isra.14+0x34/0x90)
[   21.456826] [<c013b3a8>] (put_aio_ring_file.isra.14) from
[<c013b424>] (aio_free_ring+0x20/0xcc)
[   21.465660] [<c013b424>] (aio_free_ring) from [<c013b4f4>]
(free_ioctx+0x24/0x44)
[   21.473190] [<c013b4f4>] (free_ioctx) from [<c003d8d8>]
(process_one_work+0x134/0x47c)
[   21.481132] [<c003d8d8>] (process_one_work) from [<c003e988>]
(worker_thread+0x130/0x414)
[   21.489350] [<c003e988>] (worker_thread) from [<c00448ac>]
(kthread+0xd4/0xec)
[   21.496621] [<c00448ac>] (kthread) from [<c000ec18>]
(ret_from_fork+0x14/0x20)
[   21.503884] ---[ end trace 79c4bf42c038c9a1 ]---

The cause is that we set the aio ring file pages as *DIRTY* via SetPageDirty
(bypasses the VFS dirty pages increment) when init, and aio fs uses
*default_backing_dev_info* as the backing dev, which does not disable
the dirty pages accounting capability.
So truncating aio ring file will contribute to accounting dirty pages (VFS
dirty pages decrement), then error occurs.

The original goal is keeping these pages in memory (can not be reclaimed
or swapped) in life-time via marking it dirty. But thinking more, we have
already pinned pages via elevating the page's refcount, which can already
achieve the goal, so the SetPageDirty seems unnecessary.

In order to fix the issue, using the __set_page_dirty_no_writeback instead
of the nop .set_page_dirty, and dropped the SetPageDirty (don't manually
set the dirty flags, don't disable set_page_dirty(), rely on default behaviour).

With the above change, the dirty pages accounting can work well. But as we
known, aio fs is an anonymous one, which should never cause any real write-back,
we can ignore the dirty pages (write back) accounting by disabling the dirty
pages (write back) accounting capability. So we introduce an aio private
backing dev info (disabled the ACCT_DIRTY/WRITEBACK/ACCT_WB capabilities) to
replace the default one.

Reported-by: Markus Königshaus <m.koenigshaus@wut.de>
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Acked-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
---
 fs/aio.c | 21 ++++++++++++++-------
 1 file changed, 14 insertions(+), 7 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index 66bd7e4447ad..307d7708dc00 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -164,6 +164,15 @@ static struct vfsmount *aio_mnt;
 static const struct file_operations aio_ring_fops;
 static const struct address_space_operations aio_ctx_aops;
 
+/* Backing dev info for aio fs.
+ * -no dirty page accounting or writeback happens
+ */
+static struct backing_dev_info aio_fs_backing_dev_info = {
+	.name           = "aiofs",
+	.state          = 0,
+	.capabilities   = BDI_CAP_NO_ACCT_AND_WRITEBACK | BDI_CAP_MAP_COPY,
+};
+
 static struct file *aio_private_file(struct kioctx *ctx, loff_t nr_pages)
 {
 	struct qstr this = QSTR_INIT("[aio]", 5);
@@ -175,6 +184,7 @@ static struct file *aio_private_file(struct kioctx *ctx, loff_t nr_pages)
 
 	inode->i_mapping->a_ops = &aio_ctx_aops;
 	inode->i_mapping->private_data = ctx;
+	inode->i_mapping->backing_dev_info = &aio_fs_backing_dev_info;
 	inode->i_size = PAGE_SIZE * nr_pages;
 
 	path.dentry = d_alloc_pseudo(aio_mnt->mnt_sb, &this);
@@ -220,6 +230,9 @@ static int __init aio_setup(void)
 	if (IS_ERR(aio_mnt))
 		panic("Failed to create aio fs mount.");
 
+	if (bdi_init(&aio_fs_backing_dev_info))
+		panic("Failed to init aio fs backing dev info.");
+
 	kiocb_cachep = KMEM_CACHE(kiocb, SLAB_HWCACHE_ALIGN|SLAB_PANIC);
 	kioctx_cachep = KMEM_CACHE(kioctx,SLAB_HWCACHE_ALIGN|SLAB_PANIC);
 
@@ -281,11 +294,6 @@ static const struct file_operations aio_ring_fops = {
 	.mmap = aio_ring_mmap,
 };
 
-static int aio_set_page_dirty(struct page *page)
-{
-	return 0;
-}
-
 #if IS_ENABLED(CONFIG_MIGRATION)
 static int aio_migratepage(struct address_space *mapping, struct page *new,
 			struct page *old, enum migrate_mode mode)
@@ -357,7 +365,7 @@ out:
 #endif
 
 static const struct address_space_operations aio_ctx_aops = {
-	.set_page_dirty = aio_set_page_dirty,
+	.set_page_dirty = __set_page_dirty_no_writeback,
 #if IS_ENABLED(CONFIG_MIGRATION)
 	.migratepage	= aio_migratepage,
 #endif
@@ -412,7 +420,6 @@ static int aio_setup_ring(struct kioctx *ctx)
 		pr_debug("pid(%d) page[%d]->count=%d\n",
 			 current->pid, i, page_count(page));
 		SetPageUptodate(page);
-		SetPageDirty(page);
 		unlock_page(page);
 
 		ctx->ring_pages[i] = page;
-- 
2.1.3


  parent reply	other threads:[~2014-12-06 15:13 UTC|newest]

Thread overview: 71+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-12-06 15:07 [PATCH 3.12 00/66] 3.12.35-stable review Jiri Slaby
2014-12-06 15:06 ` [PATCH 3.12 01/66] Input: serio - add firmware_id sysfs attribute Jiri Slaby
2014-12-06 15:06 ` [PATCH 3.12 02/66] Input: i8042 - add firmware_id support Jiri Slaby
2014-12-06 15:06 ` [PATCH 3.12 03/66] Input: Add INPUT_PROP_TOPBUTTONPAD device property Jiri Slaby
2014-12-06 15:06 ` [PATCH 3.12 04/66] Input: synaptics - report INPUT_PROP_TOPBUTTONPAD property Jiri Slaby
2014-12-06 15:06 ` [PATCH 3.12 05/66] Input: synaptics - add a matches_pnp_id helper function Jiri Slaby
2014-12-06 15:06 ` [PATCH 3.12 06/66] Input: synaptics - change min/max quirk table to pnp-id matching Jiri Slaby
2014-12-06 15:06 ` [PATCH 3.12 07/66] Input: synaptics - fix resolution for manually provided min/max Jiri Slaby
2014-12-06 15:07 ` [PATCH 3.12 08/66] Input: synaptics - add min/max quirk for pnp-id LEN2002 (Edge E531) Jiri Slaby
2014-12-06 15:07 ` [PATCH 3.12 09/66] Input: synaptics - add min/max quirk for Lenovo T440s Jiri Slaby
2014-12-06 15:07 ` [PATCH 3.12 10/66] PCI/MSI: Return msix_capability_init() failure if populate_msi_sysfs() fails Jiri Slaby
2014-12-06 15:07 ` [PATCH 3.12 11/66] MIPS: oprofile: Fix backtrace on 64-bit kernel Jiri Slaby
2014-12-06 15:07 ` [PATCH 3.12 12/66] MIPS: Loongson: Make platform serial setup always built-in Jiri Slaby
2014-12-06 15:07 ` [PATCH 3.12 13/66] x86_64, traps: Fix the espfix64 #DF fixup and rewrite it in C Jiri Slaby
2014-12-06 15:07 ` [PATCH 3.12 14/66] x86_64, traps: Stop using IST for #SS Jiri Slaby
2014-12-17 15:56   ` Borislav Petkov
2014-12-06 15:07 ` [PATCH 3.12 15/66] x86_64, traps: Rework bad_iret Jiri Slaby
2014-12-06 15:07 ` [PATCH 3.12 16/66] x86: Require exact match for 'noxsave' command line option Jiri Slaby
2014-12-06 15:07 ` [PATCH 3.12 17/66] x86, mm: Set NX across entire PMD at boot Jiri Slaby
2014-12-06 15:07 ` [PATCH 3.12 18/66] uprobes, x86: Fix _TIF_UPROBE vs _TIF_NOTIFY_RESUME Jiri Slaby
2014-12-06 15:07 ` [PATCH 3.12 19/66] PCI/MSI: Add device flag indicating that 64-bit MSIs don't work Jiri Slaby
2014-12-06 15:07 ` [PATCH 3.12 20/66] clockevent: sun4i: Fix race condition in the probe code Jiri Slaby
2014-12-06 15:07 ` [PATCH 3.12 21/66] IB/isert: Adjust CQ size to HW limits Jiri Slaby
2014-12-06 15:07 ` [PATCH 3.12 22/66] ib_isert: Add max_send_sge=2 minimum for control PDU responses Jiri Slaby
2014-12-06 15:07 ` [PATCH 3.12 23/66] ASoC: rsnd: remove unsupported PAUSE flag Jiri Slaby
2014-12-06 15:07 ` [PATCH 3.12 24/66] ASoC: fsi: " Jiri Slaby
2014-12-06 15:07 ` [PATCH 3.12 25/66] ASoC: sgtl5000: Fix SMALL_POP bit definition Jiri Slaby
2014-12-06 15:07 ` [PATCH 3.12 26/66] ASoC: wm_adsp: Avoid attempt to free buffers that might still be in use Jiri Slaby
2014-12-06 15:07 ` [PATCH 3.12 27/66] ASoC: dpcm: Fix race between FE/BE updates and trigger Jiri Slaby
2014-12-06 15:07 ` [PATCH 3.12 28/66] ath9k: Fix RTC_DERIVED_CLK usage Jiri Slaby
2014-12-06 15:07 ` [PATCH 3.12 29/66] of/base: Fix PowerPC address parsing hack Jiri Slaby
2014-12-06 15:07 ` [PATCH 3.12 30/66] powerpc/pseries: Honor the generic "no_64bit_msi" flag Jiri Slaby
2014-12-06 15:07 ` [PATCH 3.12 31/66] powerpc/pseries: Fix endiannes issue in RTAS call from xmon Jiri Slaby
2014-12-06 15:07 ` [PATCH 3.12 32/66] iio: Fix IIO_EVENT_CODE_EXTRACT_DIR bit mask Jiri Slaby
2014-12-06 15:07 ` [PATCH 3.12 33/66] staging: r8188eu: Add new device ID for DLink GO-USB-N150 Jiri Slaby
2014-12-06 15:07 ` [PATCH 3.12 34/66] USB: ssu100: fix overrun-error reporting Jiri Slaby
2014-12-06 15:07 ` [PATCH 3.12 35/66] USB: keyspan: " Jiri Slaby
2014-12-06 15:07 ` [PATCH 3.12 36/66] USB: keyspan: fix tty line-status reporting Jiri Slaby
2014-12-06 15:07 ` [PATCH 3.12 37/66] USB: serial: cp210x: add IDs for CEL MeshConnect USB Stick Jiri Slaby
2014-12-06 15:07 ` [PATCH 3.12 38/66] usb: serial: ftdi_sio: add PIDs for Matrix Orbital products Jiri Slaby
2014-12-06 15:07 ` [PATCH 3.12 39/66] usb-quirks: Add reset-resume quirk for MS Wireless Laser Mouse 6000 Jiri Slaby
2014-12-06 15:07 ` [PATCH 3.12 40/66] USB: xhci: don't start a halted endpoint before its new dequeue is set Jiri Slaby
2014-12-06 15:07 ` [PATCH 3.12 41/66] usb: xhci: rework root port wake bits if controller isn't allowed to wakeup Jiri Slaby
2014-12-06 15:07 ` [PATCH 3.12 42/66] can: esd_usb2: fix memory leak on disconnect Jiri Slaby
2014-12-06 15:07 ` [PATCH 3.12 43/66] ALSA: usb-audio: Add ctrl message delay quirk for Marantz/Denon devices Jiri Slaby
2014-12-06 15:07 ` Jiri Slaby [this message]
2014-12-06 15:07 ` [PATCH 3.12 45/66] ARM: 8216/1: xscale: correct auxiliary register in suspend/resume Jiri Slaby
2014-12-06 15:07 ` [PATCH 3.12 46/66] ARM: 8222/1: mvebu: enable strex backoff delay Jiri Slaby
2014-12-06 15:07 ` [PATCH 3.12 47/66] ARM: 8226/1: cacheflush: get rid of restarting block Jiri Slaby
2014-12-06 15:07 ` [PATCH 3.12 48/66] Input: synaptics - adjust min/max on Thinkpad E540 Jiri Slaby
2014-12-06 15:07 ` [PATCH 3.12 49/66] Input: xpad - use proper endpoint type Jiri Slaby
2014-12-06 15:07 ` [PATCH 3.12 50/66] srp-target: Retry when QP creation fails with ENOMEM Jiri Slaby
2014-12-06 15:07 ` [PATCH 3.12 51/66] target: Don't call TFO->write_pending if data_length == 0 Jiri Slaby
2014-12-06 15:07 ` [PATCH 3.12 52/66] iser-target: Handle DEVICE_REMOVAL event on network portal listener correctly Jiri Slaby
2014-12-06 15:07 ` [PATCH 3.12 53/66] spi: dw: Fix dynamic speed change Jiri Slaby
2014-12-06 15:07 ` [PATCH 3.12 54/66] vhost-scsi: Take configfs group dependency during VHOST_SCSI_SET_ENDPOINT Jiri Slaby
2014-12-06 15:07 ` [PATCH 3.12 55/66] scsi: add Intel Multi-Flex to scsi scan blacklist Jiri Slaby
2014-12-06 15:07 ` [PATCH 3.12 56/66] can: dev: avoid calling kfree_skb() from interrupt context Jiri Slaby
2014-12-06 15:07 ` [PATCH 3.12 57/66] rt2x00: do not align payload on modern H/W Jiri Slaby
2014-12-06 15:07 ` [PATCH 3.12 58/66] nfsd: correctly define v4.2 support attributes Jiri Slaby
2014-12-06 15:07 ` [PATCH 3.12 59/66] nfsd: Fix slot wake up race in the nfsv4.1 callback code Jiri Slaby
2014-12-06 15:07 ` [PATCH 3.12 60/66] net/ping: handle protocol mismatching scenario Jiri Slaby
2014-12-06 15:07 ` [PATCH 3.12 61/66] bnx2fc: do not add shared skbs to the fcoe_rx_list Jiri Slaby
2014-12-06 15:07 ` [PATCH 3.12 62/66] drm/radeon: fix endian swapping in vbios fetch for tdp table Jiri Slaby
2014-12-06 15:07 ` [PATCH 3.12 63/66] gpu/radeon: Set flag to indicate broken 64-bit MSI Jiri Slaby
2014-12-06 15:07 ` [PATCH 3.12 64/66] locks: eliminate BUG() call when there's an unexpected lock on file close Jiri Slaby
2014-12-06 15:07 ` [PATCH 3.12 65/66] powerpc/powernv: Honor the generic "no_64bit_msi" flag Jiri Slaby
2014-12-06 15:07 ` [PATCH 3.12 66/66] batman: fix a bogus warning from batadv_is_on_batman_iface() Jiri Slaby
2014-12-07  0:09 ` [PATCH 3.12 00/66] 3.12.35-stable review Guenter Roeck
2014-12-11  9:57   ` Jiri Slaby
2014-12-08 17:04 ` Shuah Khan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=cab86d8c6534900243994bb0558e66a41fcbc39d.1417878427.git.jslaby@suse.cz \
    --to=jslaby@suse.cz \
    --cc=bcrl@kvack.org \
    --cc=guz.fnst@cn.fujitsu.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=stable@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.