All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/5] xfs: quota deadlock fixes
@ 2017-02-15 15:40 Brian Foster
  2017-02-15 15:40 ` [PATCH 1/5] xfs: bypass dquot reclaim to avoid quotacheck deadlock Brian Foster
                   ` (6 more replies)
  0 siblings, 7 replies; 23+ messages in thread
From: Brian Foster @ 2017-02-15 15:40 UTC (permalink / raw)
  To: linux-xfs; +Cc: Eryu Guan, Dave Chinner

Hi all,

This is a collection of several quota related deadlock fixes for
problems that have been reported to the list recently.

Patch 1 fixes the low memory quotacheck problem reported by Martin[1].
Dave is CC'd as he had comments on this particular thread that started a
discussion, but I hadn't heard anything back since my last response.

Patch 2 fixes a separate problem I ran into while attempting to
reproduce Eryu's xfs/305 hang report[2]. 

Patches 3-5 fix the actual problem reported by Eryu, which is a quotaoff
deadlock reproduced by xfs/305.

Further details are included in the individual commit log descriptions.
Thoughts, reviews, flames appreciated.

Eryu,

I've run several hundred iterations of this on your reproducer system
without reproducing the hang. I have reproduced a reset overnight but
still haven't been able to grab a stack trace from that occurrence (I'll
try again today/tonight with better console logging). I suspect this is
a separate problem (possibly just an ASSERT() failure as this is a DEBUG
kernel), but I'd appreciate any testing you can run against these
patches to verify.

Brian

[1] http://www.spinics.net/lists/linux-xfs/msg01901.html
[2] http://www.spinics.net/lists/linux-xfs/msg03835.html

Brian Foster (5):
  xfs: bypass dquot reclaim to avoid quotacheck deadlock
  xfs: allocate quotaoff transactions up front to avoid log deadlock
  xfs: support ability to wait on new inodes
  xfs: update ag iterator to support wait on new inodes
  xfs: wait on new inodes during quotaoff dquot release

 fs/xfs/xfs_icache.c      | 58 ++++++++++++++++++++++++++++++++++++++++--------
 fs/xfs/xfs_icache.h      |  8 +++++++
 fs/xfs/xfs_inode.h       |  4 +++-
 fs/xfs/xfs_qm.c          | 11 +++++++++
 fs/xfs/xfs_qm.h          |  1 +
 fs/xfs/xfs_qm_syscalls.c | 45 +++++++++++++++++++++----------------
 6 files changed, 98 insertions(+), 29 deletions(-)

-- 
2.7.4


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 1/5] xfs: bypass dquot reclaim to avoid quotacheck deadlock
  2017-02-15 15:40 [PATCH 0/5] xfs: quota deadlock fixes Brian Foster
@ 2017-02-15 15:40 ` Brian Foster
  2017-02-16 22:37   ` Dave Chinner
  2017-02-15 15:40 ` [PATCH 2/5] xfs: allocate quotaoff transactions up front to avoid log deadlock Brian Foster
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 23+ messages in thread
From: Brian Foster @ 2017-02-15 15:40 UTC (permalink / raw)
  To: linux-xfs; +Cc: Eryu Guan, Dave Chinner

Reclaim during quotacheck can lead to deadlocks on the dquot flush lock:

 - Quotacheck populates a local delwri queue with the physical dquot
   buffers.
 - Quotacheck performs the xfs_qm_dqusage_adjust() bulkstat and dirties
   all of the dquots.
 - Reclaim kicks in and attempts to flush a dquot whose buffer is
   already queud on the quotacheck queue. The flush succeeds but
   queueing to the reclaim delwri queue fails as the backing buffer is
   already queued. The flush unlock is now deferred to I/O completion of
   the buffer from the quotacheck queue.
 - Quotacheck proceeds to the xfs_qm_flush_one() walk which requires the
   flush lock to update the backing buffers with the in-core
   recalculated values. This deadlocks as the flush lock was acquired by
   reclaim but the buffer never submitted for I/O as it already resided
   on the quotacheck queue.

This is reproduced by running quotacheck on a filesystem with a couple
million inodes in low memory (512MB-1GB) situations.

Quotacheck first resets and collects the physical dquot buffers in a
delwri queue. Then, it traverses the filesystem inodes via bulkstat,
updates the in-core dquots, flushes the corrected dquots to the backing
buffers and finally submits the delwri queue for I/O. Since the backing
buffers are queued across the entire quotacheck operation, dquot reclaim
cannot possibly complete a dquot flush before quotacheck completes.
Therefore, dquot reclaim provides no real value during quotacheck.

Lock out dquot reclaim during quotacheck to avoid the deadlock. Define
and set a new quotainfo flag when quotacheck is in progress that reclaim
can use to bypass processing as appropriate.

Reported-by: Martin Svec <martin.svec@zoner.cz>
Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/xfs_qm.c | 11 +++++++++++
 fs/xfs/xfs_qm.h |  1 +
 2 files changed, 12 insertions(+)

diff --git a/fs/xfs/xfs_qm.c b/fs/xfs/xfs_qm.c
index b669b12..7c44a6e 100644
--- a/fs/xfs/xfs_qm.c
+++ b/fs/xfs/xfs_qm.c
@@ -527,6 +527,8 @@ xfs_qm_shrink_scan(
 
 	if ((sc->gfp_mask & (__GFP_FS|__GFP_DIRECT_RECLAIM)) != (__GFP_FS|__GFP_DIRECT_RECLAIM))
 		return 0;
+	if (qi->qi_quotacheck)
+		return 0;
 
 	INIT_LIST_HEAD(&isol.buffers);
 	INIT_LIST_HEAD(&isol.dispose);
@@ -623,6 +625,7 @@ xfs_qm_init_quotainfo(
 	INIT_RADIX_TREE(&qinf->qi_gquota_tree, GFP_NOFS);
 	INIT_RADIX_TREE(&qinf->qi_pquota_tree, GFP_NOFS);
 	mutex_init(&qinf->qi_tree_lock);
+	qinf->qi_quotacheck = false;
 
 	/* mutex used to serialize quotaoffs */
 	mutex_init(&qinf->qi_quotaofflock);
@@ -1294,6 +1297,12 @@ xfs_qm_quotacheck(
 	ASSERT(uip || gip || pip);
 	ASSERT(XFS_IS_QUOTA_RUNNING(mp));
 
+	/*
+	 * Set a flag to lock out dquot reclaim during quotacheck. The dquot
+	 * shrinker can cause flush lock deadlocks by attempting to flush dquots
+	 * whose backing buffers are already on the quotacheck delwri queue.
+	 */
+	mp->m_quotainfo->qi_quotacheck = true;
 	xfs_notice(mp, "Quotacheck needed: Please wait.");
 
 	/*
@@ -1384,6 +1393,8 @@ xfs_qm_quotacheck(
 	mp->m_qflags |= flags;
 
  error_return:
+	mp->m_quotainfo->qi_quotacheck = false;
+
 	while (!list_empty(&buffer_list)) {
 		struct xfs_buf *bp =
 			list_first_entry(&buffer_list, struct xfs_buf, b_list);
diff --git a/fs/xfs/xfs_qm.h b/fs/xfs/xfs_qm.h
index 2975a82..d5a443d 100644
--- a/fs/xfs/xfs_qm.h
+++ b/fs/xfs/xfs_qm.h
@@ -89,6 +89,7 @@ typedef struct xfs_quotainfo {
 	struct xfs_def_quota	qi_grp_default;
 	struct xfs_def_quota	qi_prj_default;
 	struct shrinker  qi_shrinker;
+	bool			qi_quotacheck;	/* quotacheck is running */
 } xfs_quotainfo_t;
 
 static inline struct radix_tree_root *
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 2/5] xfs: allocate quotaoff transactions up front to avoid log deadlock
  2017-02-15 15:40 [PATCH 0/5] xfs: quota deadlock fixes Brian Foster
  2017-02-15 15:40 ` [PATCH 1/5] xfs: bypass dquot reclaim to avoid quotacheck deadlock Brian Foster
@ 2017-02-15 15:40 ` Brian Foster
  2017-04-26 21:23   ` Darrick J. Wong
  2017-02-15 15:40 ` [PATCH 3/5] xfs: support ability to wait on new inodes Brian Foster
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 23+ messages in thread
From: Brian Foster @ 2017-02-15 15:40 UTC (permalink / raw)
  To: linux-xfs; +Cc: Eryu Guan, Dave Chinner

The quotaoff operation commits two explicitly synchronous
transactions to correctly handle log recovery of dquots being
modified at the time the quotaoff occurs. The first quotaoff
transaction pins the tail of the log with a qoff logitem in the AIL
to ensure further logged dquots stay ahead of the quotaoff operation
in the log. The second quotaoff_end transaction is committed after
the quotaoff operation completes, releases the original log item and
thus unpins the log.

Since the first transaction pins the tail of the log, this means a
finite amount of space in the log is available between the time a
quotaoff starts and completes before transaction reservations have
to block on the log. While the quotaoff operation itself consumes no
further log space, it is possible for other operations in the
filesystem to consume the remaining log space before the quotaoff
completes. If this occurs, the quotaoff_end transaction also blocks
on the log which prevents the release of the log item and the
filesystem deadlocks. This has been reproduced via repeated xfs/305
iterations on a vm with fairly limited resources.

To avoid a deadlock due to a particularly slow quotaoff operation,
allocate the quotaoff_end transaction immediately after the initial
quotaoff transaction is committed. Carry a reference to the
transaction through xfs_qm_scall_quotaoff() rather than the qoff log
item and commit it once the quotaoff completes.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/xfs_qm_syscalls.c | 42 ++++++++++++++++++++++++------------------
 1 file changed, 24 insertions(+), 18 deletions(-)

diff --git a/fs/xfs/xfs_qm_syscalls.c b/fs/xfs/xfs_qm_syscalls.c
index 475a388..dbb6802 100644
--- a/fs/xfs/xfs_qm_syscalls.c
+++ b/fs/xfs/xfs_qm_syscalls.c
@@ -35,9 +35,11 @@
 #include "xfs_trace.h"
 #include "xfs_icache.h"
 
-STATIC int	xfs_qm_log_quotaoff(xfs_mount_t *, xfs_qoff_logitem_t **, uint);
-STATIC int	xfs_qm_log_quotaoff_end(xfs_mount_t *, xfs_qoff_logitem_t *,
-					uint);
+STATIC int	xfs_qm_log_quotaoff(struct xfs_mount *, struct xfs_trans **,
+				    uint);
+STATIC int	xfs_qm_log_quotaoff_end(struct xfs_mount *,
+					struct xfs_qoff_logitem *,
+					struct xfs_trans **, uint);
 
 /*
  * Turn off quota accounting and/or enforcement for all udquots and/or
@@ -56,7 +58,7 @@ xfs_qm_scall_quotaoff(
 	uint			dqtype;
 	int			error;
 	uint			inactivate_flags;
-	xfs_qoff_logitem_t	*qoffstart;
+	struct xfs_trans	*qend_tp;
 
 	/*
 	 * No file system can have quotas enabled on disk but not in core.
@@ -128,7 +130,7 @@ xfs_qm_scall_quotaoff(
 	 * and synchronously. If we fail to write, we should abort the
 	 * operation as it cannot be recovered safely if we crash.
 	 */
-	error = xfs_qm_log_quotaoff(mp, &qoffstart, flags);
+	error = xfs_qm_log_quotaoff(mp, &qend_tp, flags);
 	if (error)
 		goto out_unlock;
 
@@ -181,7 +183,7 @@ xfs_qm_scall_quotaoff(
 	 * So, we have QUOTAOFF start and end logitems; the start
 	 * logitem won't get overwritten until the end logitem appears...
 	 */
-	error = xfs_qm_log_quotaoff_end(mp, qoffstart, flags);
+	error = xfs_trans_commit(qend_tp);
 	if (error) {
 		/* We're screwed now. Shutdown is the only option. */
 		xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
@@ -556,13 +558,14 @@ xfs_qm_scall_setqlim(
 
 STATIC int
 xfs_qm_log_quotaoff_end(
-	xfs_mount_t		*mp,
-	xfs_qoff_logitem_t	*startqoff,
+	struct xfs_mount	*mp,
+	struct xfs_qoff_logitem	*startqoff,
+	struct xfs_trans	**tpp,
 	uint			flags)
 {
-	xfs_trans_t		*tp;
+	struct xfs_trans	*tp;
 	int			error;
-	xfs_qoff_logitem_t	*qoffi;
+	struct xfs_qoff_logitem	*qoffi;
 
 	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_qm_equotaoff, 0, 0, 0, &tp);
 	if (error)
@@ -578,21 +581,22 @@ xfs_qm_log_quotaoff_end(
 	 * We don't care about quotoff's performance.
 	 */
 	xfs_trans_set_sync(tp);
-	return xfs_trans_commit(tp);
+	*tpp = tp;
+	return 0;
 }
 
 
 STATIC int
 xfs_qm_log_quotaoff(
-	xfs_mount_t	       *mp,
-	xfs_qoff_logitem_t     **qoffstartp,
-	uint		       flags)
+	struct xfs_mount	*mp,
+	struct xfs_trans	**end_tp,
+	uint			flags)
 {
-	xfs_trans_t	       *tp;
+	struct xfs_trans	*tp;
 	int			error;
-	xfs_qoff_logitem_t     *qoffi;
+	struct xfs_qoff_logitem	*qoffi;
 
-	*qoffstartp = NULL;
+	*end_tp = NULL;
 
 	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_qm_quotaoff, 0, 0, 0, &tp);
 	if (error)
@@ -617,7 +621,9 @@ xfs_qm_log_quotaoff(
 	if (error)
 		goto out;
 
-	*qoffstartp = qoffi;
+	error = xfs_qm_log_quotaoff_end(mp, qoffi, end_tp, flags);
+	if (error)
+		xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
 out:
 	return error;
 }
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 3/5] xfs: support ability to wait on new inodes
  2017-02-15 15:40 [PATCH 0/5] xfs: quota deadlock fixes Brian Foster
  2017-02-15 15:40 ` [PATCH 1/5] xfs: bypass dquot reclaim to avoid quotacheck deadlock Brian Foster
  2017-02-15 15:40 ` [PATCH 2/5] xfs: allocate quotaoff transactions up front to avoid log deadlock Brian Foster
@ 2017-02-15 15:40 ` Brian Foster
  2017-04-27 21:15   ` Darrick J. Wong
  2017-02-15 15:40 ` [PATCH 4/5] xfs: update ag iterator to support " Brian Foster
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 23+ messages in thread
From: Brian Foster @ 2017-02-15 15:40 UTC (permalink / raw)
  To: linux-xfs; +Cc: Eryu Guan, Dave Chinner

Inodes that are inserted into the perag tree but still under
construction are flagged with the XFS_INEW bit. Most contexts either
skip such inodes when they are encountered or have the ability to
handle them.

The runtime quotaoff sequence introduces a context that must wait
for construction of such inodes to correctly ensure that all dquots
in the fs are released. In anticipation of this, support the ability
to wait on new inodes. Wake the appropriate bit when XFS_INEW is
cleared.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/xfs_icache.c | 5 ++++-
 fs/xfs/xfs_inode.h  | 4 +++-
 2 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 7234b97..bb55fd7 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -366,14 +366,17 @@ xfs_iget_cache_hit(
 
 		error = xfs_reinit_inode(mp, inode);
 		if (error) {
+			bool wake;
 			/*
 			 * Re-initializing the inode failed, and we are in deep
 			 * trouble.  Try to re-add it to the reclaim list.
 			 */
 			rcu_read_lock();
 			spin_lock(&ip->i_flags_lock);
-
+			wake = !!__xfs_iflags_test(ip, XFS_INEW);
 			ip->i_flags &= ~(XFS_INEW | XFS_IRECLAIM);
+			if (wake)
+				wake_up_bit(&ip->i_flags, __XFS_INEW_BIT);
 			ASSERT(ip->i_flags & XFS_IRECLAIMABLE);
 			trace_xfs_iget_reclaim_fail(ip);
 			goto out_error;
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index 10dcf27..10e89fc 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -216,7 +216,8 @@ static inline bool xfs_is_reflink_inode(struct xfs_inode *ip)
 #define XFS_IRECLAIM		(1 << 0) /* started reclaiming this inode */
 #define XFS_ISTALE		(1 << 1) /* inode has been staled */
 #define XFS_IRECLAIMABLE	(1 << 2) /* inode can be reclaimed */
-#define XFS_INEW		(1 << 3) /* inode has just been allocated */
+#define __XFS_INEW_BIT		3	 /* inode has just been allocated */
+#define XFS_INEW		(1 << __XFS_INEW_BIT)
 #define XFS_ITRUNCATED		(1 << 5) /* truncated down so flush-on-close */
 #define XFS_IDIRTY_RELEASE	(1 << 6) /* dirty release already seen */
 #define __XFS_IFLOCK_BIT	7	 /* inode is being flushed right now */
@@ -464,6 +465,7 @@ static inline void xfs_finish_inode_setup(struct xfs_inode *ip)
 	xfs_iflags_clear(ip, XFS_INEW);
 	barrier();
 	unlock_new_inode(VFS_I(ip));
+	wake_up_bit(&ip->i_flags, __XFS_INEW_BIT);
 }
 
 static inline void xfs_setup_existing_inode(struct xfs_inode *ip)
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 4/5] xfs: update ag iterator to support wait on new inodes
  2017-02-15 15:40 [PATCH 0/5] xfs: quota deadlock fixes Brian Foster
                   ` (2 preceding siblings ...)
  2017-02-15 15:40 ` [PATCH 3/5] xfs: support ability to wait on new inodes Brian Foster
@ 2017-02-15 15:40 ` Brian Foster
  2017-04-27 21:17   ` Darrick J. Wong
  2017-02-15 15:40 ` [PATCH 5/5] xfs: wait on new inodes during quotaoff dquot release Brian Foster
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 23+ messages in thread
From: Brian Foster @ 2017-02-15 15:40 UTC (permalink / raw)
  To: linux-xfs; +Cc: Eryu Guan, Dave Chinner

The AG inode iterator currently skips new inodes as such inodes are
inserted into the inode radix tree before they are fully
constructed. Certain contexts require the ability to wait on the
construction of new inodes, however. The fs-wide dquot release from
the quotaoff sequence is an example of this.

Update the AG inode iterator to support the ability to wait on
inodes flagged with XFS_INEW upon request. Create a new
xfs_inode_ag_iterator_flags() interface and support a set of
iteration flags to modify the iteration behavior. When the
XFS_AGITER_INEW_WAIT flag is set, include XFS_INEW flags in the
radix tree inode lookup and wait on them before the callback is
executed.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/xfs_icache.c | 53 +++++++++++++++++++++++++++++++++++++++++++++--------
 fs/xfs/xfs_icache.h |  8 ++++++++
 2 files changed, 53 insertions(+), 8 deletions(-)

diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index bb55fd7..cb45a70 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -262,6 +262,22 @@ xfs_inode_clear_reclaim_tag(
 	xfs_perag_clear_reclaim_tag(pag);
 }
 
+static void
+xfs_inew_wait(
+	struct xfs_inode	*ip)
+{
+	wait_queue_head_t *wq = bit_waitqueue(&ip->i_flags, __XFS_INEW_BIT);
+	DEFINE_WAIT_BIT(wait, &ip->i_flags, __XFS_INEW_BIT);
+
+	do {
+		prepare_to_wait(wq, &wait.wait, TASK_UNINTERRUPTIBLE);
+		if (!xfs_iflags_test(ip, XFS_INEW))
+			break;
+		schedule();
+	} while (true);
+	finish_wait(wq, &wait.wait);
+}
+
 /*
  * When we recycle a reclaimable inode, we need to re-initialise the VFS inode
  * part of the structure. This is made more complex by the fact we store
@@ -626,9 +642,11 @@ xfs_iget(
 
 STATIC int
 xfs_inode_ag_walk_grab(
-	struct xfs_inode	*ip)
+	struct xfs_inode	*ip,
+	int			flags)
 {
 	struct inode		*inode = VFS_I(ip);
+	bool			newinos = !!(flags & XFS_AGITER_INEW_WAIT);
 
 	ASSERT(rcu_read_lock_held());
 
@@ -646,7 +664,8 @@ xfs_inode_ag_walk_grab(
 		goto out_unlock_noent;
 
 	/* avoid new or reclaimable inodes. Leave for reclaim code to flush */
-	if (__xfs_iflags_test(ip, XFS_INEW | XFS_IRECLAIMABLE | XFS_IRECLAIM))
+	if ((!newinos && __xfs_iflags_test(ip, XFS_INEW)) ||
+	    __xfs_iflags_test(ip, XFS_IRECLAIMABLE | XFS_IRECLAIM))
 		goto out_unlock_noent;
 	spin_unlock(&ip->i_flags_lock);
 
@@ -674,7 +693,8 @@ xfs_inode_ag_walk(
 					   void *args),
 	int			flags,
 	void			*args,
-	int			tag)
+	int			tag,
+	int			iter_flags)
 {
 	uint32_t		first_index;
 	int			last_error = 0;
@@ -716,7 +736,7 @@ xfs_inode_ag_walk(
 		for (i = 0; i < nr_found; i++) {
 			struct xfs_inode *ip = batch[i];
 
-			if (done || xfs_inode_ag_walk_grab(ip))
+			if (done || xfs_inode_ag_walk_grab(ip, iter_flags))
 				batch[i] = NULL;
 
 			/*
@@ -744,6 +764,9 @@ xfs_inode_ag_walk(
 		for (i = 0; i < nr_found; i++) {
 			if (!batch[i])
 				continue;
+			if ((iter_flags & XFS_AGITER_INEW_WAIT) &&
+			    xfs_iflags_test(batch[i], XFS_INEW))
+				xfs_inew_wait(batch[i]);
 			error = execute(batch[i], flags, args);
 			IRELE(batch[i]);
 			if (error == -EAGAIN) {
@@ -823,12 +846,13 @@ xfs_cowblocks_worker(
 }
 
 int
-xfs_inode_ag_iterator(
+xfs_inode_ag_iterator_flags(
 	struct xfs_mount	*mp,
 	int			(*execute)(struct xfs_inode *ip, int flags,
 					   void *args),
 	int			flags,
-	void			*args)
+	void			*args,
+	int			iter_flags)
 {
 	struct xfs_perag	*pag;
 	int			error = 0;
@@ -838,7 +862,8 @@ xfs_inode_ag_iterator(
 	ag = 0;
 	while ((pag = xfs_perag_get(mp, ag))) {
 		ag = pag->pag_agno + 1;
-		error = xfs_inode_ag_walk(mp, pag, execute, flags, args, -1);
+		error = xfs_inode_ag_walk(mp, pag, execute, flags, args, -1,
+					  iter_flags);
 		xfs_perag_put(pag);
 		if (error) {
 			last_error = error;
@@ -850,6 +875,17 @@ xfs_inode_ag_iterator(
 }
 
 int
+xfs_inode_ag_iterator(
+	struct xfs_mount	*mp,
+	int			(*execute)(struct xfs_inode *ip, int flags,
+					   void *args),
+	int			flags,
+	void			*args)
+{
+	return xfs_inode_ag_iterator_flags(mp, execute, flags, args, 0);
+}
+
+int
 xfs_inode_ag_iterator_tag(
 	struct xfs_mount	*mp,
 	int			(*execute)(struct xfs_inode *ip, int flags,
@@ -866,7 +902,8 @@ xfs_inode_ag_iterator_tag(
 	ag = 0;
 	while ((pag = xfs_perag_get_tag(mp, ag, tag))) {
 		ag = pag->pag_agno + 1;
-		error = xfs_inode_ag_walk(mp, pag, execute, flags, args, tag);
+		error = xfs_inode_ag_walk(mp, pag, execute, flags, args, tag,
+					  0);
 		xfs_perag_put(pag);
 		if (error) {
 			last_error = error;
diff --git a/fs/xfs/xfs_icache.h b/fs/xfs/xfs_icache.h
index 8a7c849..9183f77 100644
--- a/fs/xfs/xfs_icache.h
+++ b/fs/xfs/xfs_icache.h
@@ -48,6 +48,11 @@ struct xfs_eofblocks {
 #define XFS_IGET_UNTRUSTED	0x2
 #define XFS_IGET_DONTCACHE	0x4
 
+/*
+ * flags for AG inode iterator
+ */
+#define XFS_AGITER_INEW_WAIT	0x1	/* wait on new inodes */
+
 int xfs_iget(struct xfs_mount *mp, struct xfs_trans *tp, xfs_ino_t ino,
 	     uint flags, uint lock_flags, xfs_inode_t **ipp);
 
@@ -79,6 +84,9 @@ void xfs_cowblocks_worker(struct work_struct *);
 int xfs_inode_ag_iterator(struct xfs_mount *mp,
 	int (*execute)(struct xfs_inode *ip, int flags, void *args),
 	int flags, void *args);
+int xfs_inode_ag_iterator_flags(struct xfs_mount *mp,
+	int (*execute)(struct xfs_inode *ip, int flags, void *args),
+	int flags, void *args, int iter_flags);
 int xfs_inode_ag_iterator_tag(struct xfs_mount *mp,
 	int (*execute)(struct xfs_inode *ip, int flags, void *args),
 	int flags, void *args, int tag);
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 5/5] xfs: wait on new inodes during quotaoff dquot release
  2017-02-15 15:40 [PATCH 0/5] xfs: quota deadlock fixes Brian Foster
                   ` (3 preceding siblings ...)
  2017-02-15 15:40 ` [PATCH 4/5] xfs: update ag iterator to support " Brian Foster
@ 2017-02-15 15:40 ` Brian Foster
  2017-04-27 21:17   ` Darrick J. Wong
  2017-02-16  7:42 ` [PATCH 0/5] xfs: quota deadlock fixes Eryu Guan
  2017-02-17  6:53 ` Eryu Guan
  6 siblings, 1 reply; 23+ messages in thread
From: Brian Foster @ 2017-02-15 15:40 UTC (permalink / raw)
  To: linux-xfs; +Cc: Eryu Guan, Dave Chinner

The quotaoff operation has a race with inode allocation that results
in a livelock. An inode allocation that occurs before the quota
status flags are updated acquires the appropriate dquots for the
inode via xfs_qm_vop_dqalloc(). It then inserts the XFS_INEW inode
into the perag radix tree, sometime later attaches the dquots to the
inode and finally clears the XFS_INEW flag. Quotaoff expects to
release the dquots from all inodes in the filesystem via
xfs_qm_dqrele_all_inodes(). This invokes the AG inode iterator,
which skips inodes in the XFS_INEW state because they are not fully
constructed. If the scan occurs after dquots have been attached to
an inode, but before XFS_INEW is cleared, the newly allocated inode
will continue to hold a reference to the applicable dquots. When
quotaoff invokes xfs_qm_dqpurge_all(), the reference count of those
dquot(s) remain elevated and the dqpurge scan spins indefinitely.

To address this problem, update the xfs_qm_dqrele_all_inodes() scan
to wait on inodes marked on the XFS_INEW state. We wait on the
inodes explicitly rather than skip and retry to avoid continuous
retry loops due to a parallel inode allocation workload. Since
quotaoff updates the quota state flags and uses a synchronous
transaction before the dqrele scan, and dquots are attached to
inodes after radix tree insertion iff quota is enabled, one INEW
waiting pass through the AG guarantees that the scan has processed
all inodes that could possibly hold dquot references.

Reported-by: Eryu Guan <eguan@redhat.com>
Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/xfs_qm_syscalls.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_qm_syscalls.c b/fs/xfs/xfs_qm_syscalls.c
index dbb6802..40b7c3f 100644
--- a/fs/xfs/xfs_qm_syscalls.c
+++ b/fs/xfs/xfs_qm_syscalls.c
@@ -765,5 +765,6 @@ xfs_qm_dqrele_all_inodes(
 	uint		 flags)
 {
 	ASSERT(mp->m_quotainfo);
-	xfs_inode_ag_iterator(mp, xfs_dqrele_inode, flags, NULL);
+	xfs_inode_ag_iterator_flags(mp, xfs_dqrele_inode, flags, NULL,
+				    XFS_AGITER_INEW_WAIT);
 }
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/5] xfs: quota deadlock fixes
  2017-02-15 15:40 [PATCH 0/5] xfs: quota deadlock fixes Brian Foster
                   ` (4 preceding siblings ...)
  2017-02-15 15:40 ` [PATCH 5/5] xfs: wait on new inodes during quotaoff dquot release Brian Foster
@ 2017-02-16  7:42 ` Eryu Guan
  2017-02-16 12:01   ` Brian Foster
  2017-02-17  6:53 ` Eryu Guan
  6 siblings, 1 reply; 23+ messages in thread
From: Eryu Guan @ 2017-02-16  7:42 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs, Dave Chinner

On Wed, Feb 15, 2017 at 10:40:42AM -0500, Brian Foster wrote:
> Hi all,
> 
> This is a collection of several quota related deadlock fixes for
> problems that have been reported to the list recently.
> 
> Patch 1 fixes the low memory quotacheck problem reported by Martin[1].
> Dave is CC'd as he had comments on this particular thread that started a
> discussion, but I hadn't heard anything back since my last response.
> 
> Patch 2 fixes a separate problem I ran into while attempting to
> reproduce Eryu's xfs/305 hang report[2]. 
> 
> Patches 3-5 fix the actual problem reported by Eryu, which is a quotaoff
> deadlock reproduced by xfs/305.
> 
> Further details are included in the individual commit log descriptions.
> Thoughts, reviews, flames appreciated.
> 
> Eryu,
> 
> I've run several hundred iterations of this on your reproducer system
> without reproducing the hang. I have reproduced a reset overnight but
> still haven't been able to grab a stack trace from that occurrence (I'll
> try again today/tonight with better console logging). I suspect this is
> a separate problem (possibly just an ASSERT() failure as this is a DEBUG
> kernel), but I'd appreciate any testing you can run against these
> patches to verify.

Sure, I'll give them a test and report back once I got results. Thanks
for looking into it!

Eryu

> 
> Brian
> 
> [1] http://www.spinics.net/lists/linux-xfs/msg01901.html
> [2] http://www.spinics.net/lists/linux-xfs/msg03835.html
> 
> Brian Foster (5):
>   xfs: bypass dquot reclaim to avoid quotacheck deadlock
>   xfs: allocate quotaoff transactions up front to avoid log deadlock
>   xfs: support ability to wait on new inodes
>   xfs: update ag iterator to support wait on new inodes
>   xfs: wait on new inodes during quotaoff dquot release
> 
>  fs/xfs/xfs_icache.c      | 58 ++++++++++++++++++++++++++++++++++++++++--------
>  fs/xfs/xfs_icache.h      |  8 +++++++
>  fs/xfs/xfs_inode.h       |  4 +++-
>  fs/xfs/xfs_qm.c          | 11 +++++++++
>  fs/xfs/xfs_qm.h          |  1 +
>  fs/xfs/xfs_qm_syscalls.c | 45 +++++++++++++++++++++----------------
>  6 files changed, 98 insertions(+), 29 deletions(-)
> 
> -- 
> 2.7.4
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/5] xfs: quota deadlock fixes
  2017-02-16  7:42 ` [PATCH 0/5] xfs: quota deadlock fixes Eryu Guan
@ 2017-02-16 12:01   ` Brian Foster
  0 siblings, 0 replies; 23+ messages in thread
From: Brian Foster @ 2017-02-16 12:01 UTC (permalink / raw)
  To: Eryu Guan; +Cc: linux-xfs, Dave Chinner

On Thu, Feb 16, 2017 at 03:42:34PM +0800, Eryu Guan wrote:
> On Wed, Feb 15, 2017 at 10:40:42AM -0500, Brian Foster wrote:
> > Hi all,
> > 
> > This is a collection of several quota related deadlock fixes for
> > problems that have been reported to the list recently.
> > 
> > Patch 1 fixes the low memory quotacheck problem reported by Martin[1].
> > Dave is CC'd as he had comments on this particular thread that started a
> > discussion, but I hadn't heard anything back since my last response.
> > 
> > Patch 2 fixes a separate problem I ran into while attempting to
> > reproduce Eryu's xfs/305 hang report[2]. 
> > 
> > Patches 3-5 fix the actual problem reported by Eryu, which is a quotaoff
> > deadlock reproduced by xfs/305.
> > 
> > Further details are included in the individual commit log descriptions.
> > Thoughts, reviews, flames appreciated.
> > 
> > Eryu,
> > 
> > I've run several hundred iterations of this on your reproducer system
> > without reproducing the hang. I have reproduced a reset overnight but
> > still haven't been able to grab a stack trace from that occurrence (I'll
> > try again today/tonight with better console logging). I suspect this is
> > a separate problem (possibly just an ASSERT() failure as this is a DEBUG
> > kernel), but I'd appreciate any testing you can run against these
> > patches to verify.
> 
> Sure, I'll give them a test and report back once I got results. Thanks
> for looking into it!
> 

Thanks. The test I was running ran overnight again (over 1k iterations)
without any problems. I'm logged off your server now..

Brian

> Eryu
> 
> > 
> > Brian
> > 
> > [1] http://www.spinics.net/lists/linux-xfs/msg01901.html
> > [2] http://www.spinics.net/lists/linux-xfs/msg03835.html
> > 
> > Brian Foster (5):
> >   xfs: bypass dquot reclaim to avoid quotacheck deadlock
> >   xfs: allocate quotaoff transactions up front to avoid log deadlock
> >   xfs: support ability to wait on new inodes
> >   xfs: update ag iterator to support wait on new inodes
> >   xfs: wait on new inodes during quotaoff dquot release
> > 
> >  fs/xfs/xfs_icache.c      | 58 ++++++++++++++++++++++++++++++++++++++++--------
> >  fs/xfs/xfs_icache.h      |  8 +++++++
> >  fs/xfs/xfs_inode.h       |  4 +++-
> >  fs/xfs/xfs_qm.c          | 11 +++++++++
> >  fs/xfs/xfs_qm.h          |  1 +
> >  fs/xfs/xfs_qm_syscalls.c | 45 +++++++++++++++++++++----------------
> >  6 files changed, 98 insertions(+), 29 deletions(-)
> > 
> > -- 
> > 2.7.4
> > 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/5] xfs: bypass dquot reclaim to avoid quotacheck deadlock
  2017-02-15 15:40 ` [PATCH 1/5] xfs: bypass dquot reclaim to avoid quotacheck deadlock Brian Foster
@ 2017-02-16 22:37   ` Dave Chinner
  2017-02-17 18:30     ` Brian Foster
  0 siblings, 1 reply; 23+ messages in thread
From: Dave Chinner @ 2017-02-16 22:37 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs, Eryu Guan

On Wed, Feb 15, 2017 at 10:40:43AM -0500, Brian Foster wrote:
> Reclaim during quotacheck can lead to deadlocks on the dquot flush lock:
> 
>  - Quotacheck populates a local delwri queue with the physical dquot
>    buffers.
>  - Quotacheck performs the xfs_qm_dqusage_adjust() bulkstat and dirties
>    all of the dquots.
>  - Reclaim kicks in and attempts to flush a dquot whose buffer is
>    already queud on the quotacheck queue. The flush succeeds but
>    queueing to the reclaim delwri queue fails as the backing buffer is
>    already queued. The flush unlock is now deferred to I/O completion of
>    the buffer from the quotacheck queue.
>  - Quotacheck proceeds to the xfs_qm_flush_one() walk which requires the
>    flush lock to update the backing buffers with the in-core
>    recalculated values. This deadlocks as the flush lock was acquired by
>    reclaim but the buffer never submitted for I/O as it already resided
>    on the quotacheck queue.
> 
> This is reproduced by running quotacheck on a filesystem with a couple
> million inodes in low memory (512MB-1GB) situations.
> 
> Quotacheck first resets and collects the physical dquot buffers in a
> delwri queue. Then, it traverses the filesystem inodes via bulkstat,
> updates the in-core dquots, flushes the corrected dquots to the backing
> buffers and finally submits the delwri queue for I/O. Since the backing
> buffers are queued across the entire quotacheck operation, dquot reclaim
> cannot possibly complete a dquot flush before quotacheck completes.
> Therefore, dquot reclaim provides no real value during quotacheck.

Which is an OOM vector on systems with lots of dquots and low memory
at mount time. Turning off dquot reclaim doesn't change that.

Address the root cause - the buffer list is never flushed and so
pins the memory quotacheck requires, so we need to flush the buffer
list more often.  We also need to submit the buffer list before the
flush walks begin, thereby unlocking all the dquots before doing the
flush walk and avoiding the deadlock.

-Dave.

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/5] xfs: quota deadlock fixes
  2017-02-15 15:40 [PATCH 0/5] xfs: quota deadlock fixes Brian Foster
                   ` (5 preceding siblings ...)
  2017-02-16  7:42 ` [PATCH 0/5] xfs: quota deadlock fixes Eryu Guan
@ 2017-02-17  6:53 ` Eryu Guan
  2017-02-17 17:54   ` Brian Foster
  6 siblings, 1 reply; 23+ messages in thread
From: Eryu Guan @ 2017-02-17  6:53 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs, Dave Chinner

On Wed, Feb 15, 2017 at 10:40:42AM -0500, Brian Foster wrote:
> Hi all,
> 
> This is a collection of several quota related deadlock fixes for
> problems that have been reported to the list recently.
> 
> Patch 1 fixes the low memory quotacheck problem reported by Martin[1].
> Dave is CC'd as he had comments on this particular thread that started a
> discussion, but I hadn't heard anything back since my last response.
> 
> Patch 2 fixes a separate problem I ran into while attempting to
> reproduce Eryu's xfs/305 hang report[2]. 
> 
> Patches 3-5 fix the actual problem reported by Eryu, which is a quotaoff
> deadlock reproduced by xfs/305.
> 
> Further details are included in the individual commit log descriptions.
> Thoughts, reviews, flames appreciated.
> 
> Eryu,
> 
> I've run several hundred iterations of this on your reproducer system
> without reproducing the hang. I have reproduced a reset overnight but
> still haven't been able to grab a stack trace from that occurrence (I'll
> try again today/tonight with better console logging). I suspect this is

I hit a NULL pointer dereference while testing your fix, I was running
xfs/305 for 1000 iterations and host crashed at the 639th run. Not sure
if it's the same issue you've met here. I posted dmesg log at the end of
mail. I haven't tried to see if I can reproduce it on stock linus tree
yet.

On another host, xfs/305 ran for 500 iterations so far without problems,
I'll keep it running for more time.

Thanks,
Eryu

[57779.280327] run fstests xfs/305 at 2017-02-17 14:41:53
[57779.715697] XFS (dm-5): Unmounting Filesystem
[57783.699225] XFS (dm-5): EXPERIMENTAL reverse mapping btree feature enabled. Use at your own risk!
[57783.746222] XFS (dm-5): EXPERIMENTAL reflink feature enabled. Use at your own risk!
[57783.781671] XFS (dm-5): Mounting V5 Filesystem
[57784.004821] XFS (dm-5): Ending clean mount
[57784.040650] XFS (dm-5): Unmounting Filesystem
[57787.791041] XFS (dm-5): EXPERIMENTAL reverse mapping btree feature enabled. Use at your own risk!
[57787.837644] XFS (dm-5): EXPERIMENTAL reflink feature enabled. Use at your own risk!
[57787.872553] XFS (dm-5): Mounting V5 Filesystem
[57787.989184] XFS (dm-5): Ending clean mount
[57788.007960] XFS (dm-5): Quotacheck needed: Please wait.
[57788.142359] XFS (dm-5): Quotacheck: Done.
[57788.294295] XFS (dm-5): xlog_verify_grant_tail: space > BBTOB(tail_blocks)
[57808.117713] XFS (dm-5): Unmounting Filesystem
[57808.708484] XFS (dm-5): EXPERIMENTAL reverse mapping btree feature enabled. Use at your own risk!
[57808.754928] XFS (dm-5): EXPERIMENTAL reflink feature enabled. Use at your own risk!
[57808.808546] XFS (dm-5): Mounting V5 Filesystem
[57809.092982] XFS (dm-5): Ending clean mount
[57809.113320] XFS (dm-5): Quotacheck needed: Please wait.
[57810.033450] XFS (dm-5): Quotacheck: Done.
[57811.979626] XFS (dm-5): xlog_verify_grant_tail: space > BBTOB(tail_blocks)
[57821.196437] BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
[57821.235127] IP: xlog_write+0x243/0x7b0 [xfs]
[57821.256325] PGD 0
[57821.256325]
[57821.273804] Oops: 0000 [#1] SMP
[57821.289563] Modules linked in: binfmt_misc xt_CHECKSUM ipt_MASQUERADE nf_nat_masquerade_ipv4 tun ip6t_rpfilter ipt_REJECT nf_reject_ipv4 ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter intel_rapl sb_edac edac_core x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel iTCO_wdt crypto_simd iTCO_vendor_support glue_helper ipmi_ssif cryptd hpilo hpwdt pcspkr ipmi_si i2c_i801 lpc_ich
[57821.622303]  ioatdma nfsd sg ipmi_devintf shpchp dca pcc_cpufreq ipmi_msghandler wmi acpi_power_meter acpi_cpufreq auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c sd_mod mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm tg3 uas drm ptp usb_storage serio_raw hpsa crc32c_intel i2c_core pps_core fjes scsi_transport_sas dm_mirror dm_region_hash dm_log dm_mod
[57821.794306] CPU: 3 PID: 29556 Comm: kworker/3:5 Tainted: G        W       4.10.0-rc4.xfs305+ #22
[57821.836074] Hardware name: HP ProLiant DL360 Gen9, BIOS P89 05/06/2015
[57821.865964] Workqueue: xfs-cil/dm-5 xlog_cil_push_work [xfs]
[57821.891286] task: ffff880804462d00 task.stack: ffffc900072e8000
[57821.917941] RIP: 0010:xlog_write+0x243/0x7b0 [xfs]
[57821.939935] RSP: 0018:ffffc900072ebcc8 EFLAGS: 00010246
[57821.964071] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
[57821.996048] RDX: 0000000000000006 RSI: 0000000000000000 RDI: 0000000000000000
[57822.028123] RBP: ffffc900072ebd68 R08: 0000000000000600 R09: 0000000000040000
[57822.060083] R10: 0000000000000000 R11: 0000000000000006 R12: 0000000000000000
[57822.092224] R13: ffff880804f302e0 R14: ffff880853288000 R15: ffffc90024a01600
[57822.124209] FS:  0000000000000000(0000) GS:ffff88085fcc0000(0000) knlGS:0000000000000000
[57822.160446] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[57822.186144] CR2: 0000000000000008 CR3: 0000000001c09000 CR4: 00000000001406e0
[57822.218143] Call Trace:
[57822.229306]  xlog_cil_push+0x2a6/0x470 [xfs]
[57822.250663]  xlog_cil_push_work+0x15/0x20 [xfs]
[57822.274715]  process_one_work+0x165/0x410
[57822.293371]  worker_thread+0x27f/0x4c0
[57822.310145]  kthread+0x101/0x140
[57822.324549]  ? rescuer_thread+0x3b0/0x3b0
[57822.342527]  ? kthread_park+0x90/0x90
[57822.358856]  ? do_syscall_64+0x165/0x180
[57822.376436]  ret_from_fork+0x2c/0x40
[57822.392427] Code: c8 04 88 5d 83 88 45 82 41 8b 46 08 85 c0 0f 85 f2 02 00 00 41 83 7e 2c ff 0f 84 4c 05 00 00 4c 63 65 bc 49 c1 e4 04 4c 03 65 a0 <41> f6 44 24 08 03 74 18 ba 3e 09 00 00 48 c7 c6 f6 8a 34 a0 48
[57822.477016] RIP: xlog_write+0x243/0x7b0 [xfs] RSP: ffffc900072ebcc8
[57822.505055] CR2: 0000000000000008
[57822.522334] ---[ end trace 041d7b1a49184126 ]---
[57822.548331] Kernel panic - not syncing: Fatal exception
[57822.571795] Kernel Offset: disabled
[57822.593828] ---[ end Kernel panic - not syncing: Fatal exception
[57822.621048] ------------[ cut here ]------------
[57822.641914] WARNING: CPU: 3 PID: 29556 at arch/x86/kernel/smp.c:127 native_smp_send_reschedule+0x3f/0x50
[57822.684393] Modules linked in: binfmt_misc xt_CHECKSUM ipt_MASQUERADE nf_nat_masquerade_ipv4 tun ip6t_rpfilter ipt_REJECT nf_reject_ipv4 ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter intel_rapl sb_edac edac_core x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel iTCO_wdt crypto_simd iTCO_vendor_support glue_helper ipmi_ssif cryptd hpilo hpwdt pcspkr ipmi_si i2c_i801 lpc_ich
[57823.009497]  ioatdma nfsd sg ipmi_devintf shpchp dca pcc_cpufreq ipmi_msghandler wmi acpi_power_meter acpi_cpufreq auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c sd_mod mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm tg3 uas drm ptp usb_storage serio_raw hpsa crc32c_intel i2c_core pps_core fjes scsi_transport_sas dm_mirror dm_region_hash dm_log dm_mod
[57823.170780] CPU: 3 PID: 29556 Comm: kworker/3:5 Tainted: G      D W       4.10.0-rc4.xfs305+ #22
[57823.210153] Hardware name: HP ProLiant DL360 Gen9, BIOS P89 05/06/2015
[57823.239623] Workqueue: xfs-cil/dm-5 xlog_cil_push_work [xfs]
[57823.264951] Call Trace:
[57823.277094]  <IRQ>
[57823.287711]  dump_stack+0x63/0x87
[57823.305098]  __warn+0xd1/0xf0
[57823.318359]  warn_slowpath_null+0x1d/0x20
[57823.336295]  native_smp_send_reschedule+0x3f/0x50
[57823.357374]  trigger_load_balance+0x10f/0x1f0
[57823.376913]  scheduler_tick+0xa3/0xe0
[57823.393249]  ? tick_sched_do_timer+0x70/0x70
[57823.412373]  update_process_times+0x47/0x60
[57823.431660]  tick_sched_handle.isra.18+0x25/0x60
[57823.453341]  tick_sched_timer+0x40/0x70
[57823.470881]  __hrtimer_run_queues+0xf3/0x280
[57823.490170]  hrtimer_interrupt+0xa8/0x1a0
[57823.509122]  local_apic_timer_interrupt+0x35/0x60
[57823.531331]  smp_apic_timer_interrupt+0x38/0x50
[57823.552752]  apic_timer_interrupt+0x93/0xa0
[57823.571887] RIP: 0010:panic+0x1f8/0x239
[57823.589092] RSP: 0018:ffffc900072eba10 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff10
[57823.623298] RAX: 0000000000000034 RBX: 0000000000000000 RCX: 0000000000000006
[57823.655358] RDX: 0000000000000000 RSI: 0000000000000046 RDI: ffff88085fccdfe0
[57823.687307] RBP: ffffc900072eba80 R08: 00000000fffffffe R09: 000000000000915a
[57823.719315] R10: 0000000000000005 R11: 0000000000009159 R12: ffffffff81a2f668
[57823.751391] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000046
[57823.783944]  </IRQ>
[57823.795292]  oops_end+0xb8/0xd0
[57823.812182]  no_context+0x19e/0x3f0
[57823.827792]  ? select_idle_sibling+0x2c/0x3d0
[57823.847287]  __bad_area_nosemaphore+0xee/0x1d0
[57823.867166]  ? __enqueue_entity+0x6c/0x70
[57823.885089]  bad_area_nosemaphore+0x14/0x20
[57823.903814]  __do_page_fault+0x89/0x4a0
[57823.921531]  ? check_preempt_wakeup+0x106/0x230
[57823.941952]  do_page_fault+0x30/0x80
[57823.958382]  page_fault+0x28/0x30
[57823.973847] RIP: 0010:xlog_write+0x243/0x7b0 [xfs]
[57823.996259] RSP: 0018:ffffc900072ebcc8 EFLAGS: 00010246
[57824.019806] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
[57824.051819] RDX: 0000000000000006 RSI: 0000000000000000 RDI: 0000000000000000
[57824.083915] RBP: ffffc900072ebd68 R08: 0000000000000600 R09: 0000000000040000
[57824.116122] R10: 0000000000000000 R11: 0000000000000006 R12: 0000000000000000
[57824.148050] R13: ffff880804f302e0 R14: ffff880853288000 R15: ffffc90024a01600
[57824.179966]  ? xlog_write+0x762/0x7b0 [xfs]
[57824.198643]  xlog_cil_push+0x2a6/0x470 [xfs]
[57824.217802]  xlog_cil_push_work+0x15/0x20 [xfs]
[57824.238297]  process_one_work+0x165/0x410
[57824.256209]  worker_thread+0x27f/0x4c0
[57824.272976]  kthread+0x101/0x140
[57824.288084]  ? rescuer_thread+0x3b0/0x3b0
[57824.308736]  ? kthread_park+0x90/0x90
[57824.327867]  ? do_syscall_64+0x165/0x180
[57824.345398]  ret_from_fork+0x2c/0x40
[57824.361034] ---[ end trace 041d7b1a49184127 ]---
[57824.381686] ------------[ cut here ]------------
[57824.402342] WARNING: CPU: 3 PID: 29556 at arch/x86/kernel/smp.c:127 native_smp_send_reschedule+0x3f/0x50
[57824.445004] Modules linked in: binfmt_misc xt_CHECKSUM ipt_MASQUERADE nf_nat_masquerade_ipv4 tun ip6t_rpfilter ipt_REJECT nf_reject_ipv4 ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter intel_rapl sb_edac edac_core x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel iTCO_wdt crypto_simd iTCO_vendor_support glue_helper ipmi_ssif cryptd hpilo hpwdt pcspkr ipmi_si i2c_i801 lpc_ich
[57824.763533]  ioatdma nfsd sg ipmi_devintf shpchp dca pcc_cpufreq ipmi_msghandler wmi acpi_power_meter acpi_cpufreq auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c sd_mod mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm tg3 uas drm ptp usb_storage serio_raw hpsa crc32c_intel i2c_core pps_core fjes scsi_transport_sas dm_mirror dm_region_hash dm_log dm_mod
[57824.931113] CPU: 3 PID: 29556 Comm: kworker/3:5 Tainted: G      D W       4.10.0-rc4.xfs305+ #22
[57824.970667] Hardware name: HP ProLiant DL360 Gen9, BIOS P89 05/06/2015
[57824.999927] Workqueue: xfs-cil/dm-5 xlog_cil_push_work [xfs]
[57825.026452] Call Trace:
[57825.037533]  <IRQ>
[57825.046518]  dump_stack+0x63/0x87
[57825.061285]  __warn+0xd1/0xf0
[57825.074564]  warn_slowpath_null+0x1d/0x20
[57825.092513]  native_smp_send_reschedule+0x3f/0x50
[57825.113571]  resched_curr+0xa1/0xc0
[57825.129181]  check_preempt_curr+0x70/0x90
[57825.146768]  ttwu_do_wakeup+0x19/0xe0
[57825.163141]  ttwu_do_activate+0x6f/0x80
[57825.180275]  try_to_wake_up+0x1aa/0x3b0
[57825.197440]  default_wake_function+0x12/0x20
[57825.216596]  pollwake+0x73/0x90
[57825.230654]  ? wake_up_q+0x80/0x80
[57825.246033]  __wake_up_common+0x55/0x90
[57825.263171]  __wake_up+0x39/0x50
[57825.277624]  credit_entropy_bits+0x1fe/0x2a0
[57825.296829]  ? add_interrupt_randomness+0x1b9/0x210
[57825.320068]  add_interrupt_randomness+0x1b9/0x210
[57825.345255]  handle_irq_event_percpu+0x40/0x80
[57825.365728]  handle_irq_event+0x3b/0x60
[57825.382876]  handle_edge_irq+0x8d/0x130
[57825.400120]  handle_irq+0xab/0x130
[57825.415340]  do_IRQ+0x48/0xd0
[57825.428621]  common_interrupt+0x93/0x93
[57825.446017] RIP: 0010:__do_softirq+0x6d/0x28c
[57825.465570] RSP: 0018:ffff88085fcc3f68 EFLAGS: 00000206 ORIG_RAX: ffffffffffffff1b
[57825.499744] RAX: ffff880804462d00 RBX: 0000000000000000 RCX: 0000000000000282
[57825.533033] RDX: 00000000000193fa RSI: 00000000f2fa3225 RDI: 00000000000006e0
[57825.565178] RBP: ffff88085fcc3fb8 R08: 00003496f5963280 R09: 0000000000000000
[57825.597220] R10: 0000000000000003 R11: 0000000000000020 R12: ffffffff81a2f668
[57825.629321] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000046
[57825.661321]  irq_exit+0xd9/0xf0
[57825.675454]  smp_apic_timer_interrupt+0x3d/0x50
[57825.695743]  apic_timer_interrupt+0x93/0xa0
[57825.714467] RIP: 0010:panic+0x1f8/0x239
[57825.731620] RSP: 0018:ffffc900072eba10 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff10
[57825.765508] RAX: 0000000000000034 RBX: 0000000000000000 RCX: 0000000000000006
[57825.797718] RDX: 0000000000000000 RSI: 0000000000000046 RDI: ffff88085fccdfe0
[57825.831010] RBP: ffffc900072eba80 R08: 00000000fffffffe R09: 000000000000915a
[57825.866503] R10: 0000000000000005 R11: 0000000000009159 R12: ffffffff81a2f668
[57825.898470] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000046
[57825.930824]  </IRQ>
[57825.940296]  oops_end+0xb8/0xd0
[57825.954347]  no_context+0x19e/0x3f0
[57825.969927]  ? select_idle_sibling+0x2c/0x3d0
[57825.989418]  __bad_area_nosemaphore+0xee/0x1d0
[57826.009319]  ? __enqueue_entity+0x6c/0x70
[57826.027283]  bad_area_nosemaphore+0x14/0x20
[57826.046834]  __do_page_fault+0x89/0x4a0
[57826.064164]  ? check_preempt_wakeup+0x106/0x230
[57826.084437]  do_page_fault+0x30/0x80
[57826.100450]  page_fault+0x28/0x30
[57826.115307] RIP: 0010:xlog_write+0x243/0x7b0 [xfs]
[57826.136805] RSP: 0018:ffffc900072ebcc8 EFLAGS: 00010246
[57826.160181] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
[57826.192114] RDX: 0000000000000006 RSI: 0000000000000000 RDI: 0000000000000000
[57826.224161] RBP: ffffc900072ebd68 R08: 0000000000000600 R09: 0000000000040000
[57826.256406] R10: 0000000000000000 R11: 0000000000000006 R12: 0000000000000000
[57826.288363] R13: ffff880804f302e0 R14: ffff880853288000 R15: ffffc90024a01600
[57826.320430]  ? xlog_write+0x762/0x7b0 [xfs]
[57826.341434]  xlog_cil_push+0x2a6/0x470 [xfs]
[57826.364361]  xlog_cil_push_work+0x15/0x20 [xfs]
[57826.384629]  process_one_work+0x165/0x410
[57826.402564]  worker_thread+0x27f/0x4c0
[57826.419452]  kthread+0x101/0x140
[57826.434012]  ? rescuer_thread+0x3b0/0x3b0
[57826.452045]  ? kthread_park+0x90/0x90
[57826.468470]  ? do_syscall_64+0x165/0x180
[57826.486147]  ret_from_fork+0x2c/0x40
[57826.502218] ---[ end trace 041d7b1a49184128 ]---

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/5] xfs: quota deadlock fixes
  2017-02-17  6:53 ` Eryu Guan
@ 2017-02-17 17:54   ` Brian Foster
  2017-02-20  3:52     ` Eryu Guan
  0 siblings, 1 reply; 23+ messages in thread
From: Brian Foster @ 2017-02-17 17:54 UTC (permalink / raw)
  To: Eryu Guan; +Cc: linux-xfs, Dave Chinner

On Fri, Feb 17, 2017 at 02:53:15PM +0800, Eryu Guan wrote:
> On Wed, Feb 15, 2017 at 10:40:42AM -0500, Brian Foster wrote:
> > Hi all,
> > 
> > This is a collection of several quota related deadlock fixes for
> > problems that have been reported to the list recently.
> > 
> > Patch 1 fixes the low memory quotacheck problem reported by Martin[1].
> > Dave is CC'd as he had comments on this particular thread that started a
> > discussion, but I hadn't heard anything back since my last response.
> > 
> > Patch 2 fixes a separate problem I ran into while attempting to
> > reproduce Eryu's xfs/305 hang report[2]. 
> > 
> > Patches 3-5 fix the actual problem reported by Eryu, which is a quotaoff
> > deadlock reproduced by xfs/305.
> > 
> > Further details are included in the individual commit log descriptions.
> > Thoughts, reviews, flames appreciated.
> > 
> > Eryu,
> > 
> > I've run several hundred iterations of this on your reproducer system
> > without reproducing the hang. I have reproduced a reset overnight but
> > still haven't been able to grab a stack trace from that occurrence (I'll
> > try again today/tonight with better console logging). I suspect this is
> 
> I hit a NULL pointer dereference while testing your fix, I was running
> xfs/305 for 1000 iterations and host crashed at the 639th run. Not sure
> if it's the same issue you've met here. I posted dmesg log at the end of
> mail. I haven't tried to see if I can reproduce it on stock linus tree
> yet.
> 

Interesting, thanks. I don't know for sure because I didn't hit anything
on my second overnight run, but I wouldn't be surprised if it's the
same, particularly if you hit this again. This does look like an
independent problem to me, though. A kdump might be nice, if possible,
given the difficulty to reproduce...

Brian

> On another host, xfs/305 ran for 500 iterations so far without problems,
> I'll keep it running for more time.
> 
> Thanks,
> Eryu
> 
> [57779.280327] run fstests xfs/305 at 2017-02-17 14:41:53
> [57779.715697] XFS (dm-5): Unmounting Filesystem
> [57783.699225] XFS (dm-5): EXPERIMENTAL reverse mapping btree feature enabled. Use at your own risk!
> [57783.746222] XFS (dm-5): EXPERIMENTAL reflink feature enabled. Use at your own risk!
> [57783.781671] XFS (dm-5): Mounting V5 Filesystem
> [57784.004821] XFS (dm-5): Ending clean mount
> [57784.040650] XFS (dm-5): Unmounting Filesystem
> [57787.791041] XFS (dm-5): EXPERIMENTAL reverse mapping btree feature enabled. Use at your own risk!
> [57787.837644] XFS (dm-5): EXPERIMENTAL reflink feature enabled. Use at your own risk!
> [57787.872553] XFS (dm-5): Mounting V5 Filesystem
> [57787.989184] XFS (dm-5): Ending clean mount
> [57788.007960] XFS (dm-5): Quotacheck needed: Please wait.
> [57788.142359] XFS (dm-5): Quotacheck: Done.
> [57788.294295] XFS (dm-5): xlog_verify_grant_tail: space > BBTOB(tail_blocks)
> [57808.117713] XFS (dm-5): Unmounting Filesystem
> [57808.708484] XFS (dm-5): EXPERIMENTAL reverse mapping btree feature enabled. Use at your own risk!
> [57808.754928] XFS (dm-5): EXPERIMENTAL reflink feature enabled. Use at your own risk!
> [57808.808546] XFS (dm-5): Mounting V5 Filesystem
> [57809.092982] XFS (dm-5): Ending clean mount
> [57809.113320] XFS (dm-5): Quotacheck needed: Please wait.
> [57810.033450] XFS (dm-5): Quotacheck: Done.
> [57811.979626] XFS (dm-5): xlog_verify_grant_tail: space > BBTOB(tail_blocks)
> [57821.196437] BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
> [57821.235127] IP: xlog_write+0x243/0x7b0 [xfs]
> [57821.256325] PGD 0
> [57821.256325]
> [57821.273804] Oops: 0000 [#1] SMP
> [57821.289563] Modules linked in: binfmt_misc xt_CHECKSUM ipt_MASQUERADE nf_nat_masquerade_ipv4 tun ip6t_rpfilter ipt_REJECT nf_reject_ipv4 ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter intel_rapl sb_edac edac_core x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel iTCO_wdt crypto_simd iTCO_vendor_support glue_helper ipmi_ssif cryptd hpilo hpwdt pcspkr ipmi_si i2c_i801 lpc_ich
> [57821.622303]  ioatdma nfsd sg ipmi_devintf shpchp dca pcc_cpufreq ipmi_msghandler wmi acpi_power_meter acpi_cpufreq auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c sd_mod mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm tg3 uas drm ptp usb_storage serio_raw hpsa crc32c_intel i2c_core pps_core fjes scsi_transport_sas dm_mirror dm_region_hash dm_log dm_mod
> [57821.794306] CPU: 3 PID: 29556 Comm: kworker/3:5 Tainted: G        W       4.10.0-rc4.xfs305+ #22
> [57821.836074] Hardware name: HP ProLiant DL360 Gen9, BIOS P89 05/06/2015
> [57821.865964] Workqueue: xfs-cil/dm-5 xlog_cil_push_work [xfs]
> [57821.891286] task: ffff880804462d00 task.stack: ffffc900072e8000
> [57821.917941] RIP: 0010:xlog_write+0x243/0x7b0 [xfs]
> [57821.939935] RSP: 0018:ffffc900072ebcc8 EFLAGS: 00010246
> [57821.964071] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
> [57821.996048] RDX: 0000000000000006 RSI: 0000000000000000 RDI: 0000000000000000
> [57822.028123] RBP: ffffc900072ebd68 R08: 0000000000000600 R09: 0000000000040000
> [57822.060083] R10: 0000000000000000 R11: 0000000000000006 R12: 0000000000000000
> [57822.092224] R13: ffff880804f302e0 R14: ffff880853288000 R15: ffffc90024a01600
> [57822.124209] FS:  0000000000000000(0000) GS:ffff88085fcc0000(0000) knlGS:0000000000000000
> [57822.160446] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [57822.186144] CR2: 0000000000000008 CR3: 0000000001c09000 CR4: 00000000001406e0
> [57822.218143] Call Trace:
> [57822.229306]  xlog_cil_push+0x2a6/0x470 [xfs]
> [57822.250663]  xlog_cil_push_work+0x15/0x20 [xfs]
> [57822.274715]  process_one_work+0x165/0x410
> [57822.293371]  worker_thread+0x27f/0x4c0
> [57822.310145]  kthread+0x101/0x140
> [57822.324549]  ? rescuer_thread+0x3b0/0x3b0
> [57822.342527]  ? kthread_park+0x90/0x90
> [57822.358856]  ? do_syscall_64+0x165/0x180
> [57822.376436]  ret_from_fork+0x2c/0x40
> [57822.392427] Code: c8 04 88 5d 83 88 45 82 41 8b 46 08 85 c0 0f 85 f2 02 00 00 41 83 7e 2c ff 0f 84 4c 05 00 00 4c 63 65 bc 49 c1 e4 04 4c 03 65 a0 <41> f6 44 24 08 03 74 18 ba 3e 09 00 00 48 c7 c6 f6 8a 34 a0 48
> [57822.477016] RIP: xlog_write+0x243/0x7b0 [xfs] RSP: ffffc900072ebcc8
> [57822.505055] CR2: 0000000000000008
> [57822.522334] ---[ end trace 041d7b1a49184126 ]---
> [57822.548331] Kernel panic - not syncing: Fatal exception
> [57822.571795] Kernel Offset: disabled
> [57822.593828] ---[ end Kernel panic - not syncing: Fatal exception
> [57822.621048] ------------[ cut here ]------------
> [57822.641914] WARNING: CPU: 3 PID: 29556 at arch/x86/kernel/smp.c:127 native_smp_send_reschedule+0x3f/0x50
> [57822.684393] Modules linked in: binfmt_misc xt_CHECKSUM ipt_MASQUERADE nf_nat_masquerade_ipv4 tun ip6t_rpfilter ipt_REJECT nf_reject_ipv4 ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter intel_rapl sb_edac edac_core x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel iTCO_wdt crypto_simd iTCO_vendor_support glue_helper ipmi_ssif cryptd hpilo hpwdt pcspkr ipmi_si i2c_i801 lpc_ich
> [57823.009497]  ioatdma nfsd sg ipmi_devintf shpchp dca pcc_cpufreq ipmi_msghandler wmi acpi_power_meter acpi_cpufreq auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c sd_mod mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm tg3 uas drm ptp usb_storage serio_raw hpsa crc32c_intel i2c_core pps_core fjes scsi_transport_sas dm_mirror dm_region_hash dm_log dm_mod
> [57823.170780] CPU: 3 PID: 29556 Comm: kworker/3:5 Tainted: G      D W       4.10.0-rc4.xfs305+ #22
> [57823.210153] Hardware name: HP ProLiant DL360 Gen9, BIOS P89 05/06/2015
> [57823.239623] Workqueue: xfs-cil/dm-5 xlog_cil_push_work [xfs]
> [57823.264951] Call Trace:
> [57823.277094]  <IRQ>
> [57823.287711]  dump_stack+0x63/0x87
> [57823.305098]  __warn+0xd1/0xf0
> [57823.318359]  warn_slowpath_null+0x1d/0x20
> [57823.336295]  native_smp_send_reschedule+0x3f/0x50
> [57823.357374]  trigger_load_balance+0x10f/0x1f0
> [57823.376913]  scheduler_tick+0xa3/0xe0
> [57823.393249]  ? tick_sched_do_timer+0x70/0x70
> [57823.412373]  update_process_times+0x47/0x60
> [57823.431660]  tick_sched_handle.isra.18+0x25/0x60
> [57823.453341]  tick_sched_timer+0x40/0x70
> [57823.470881]  __hrtimer_run_queues+0xf3/0x280
> [57823.490170]  hrtimer_interrupt+0xa8/0x1a0
> [57823.509122]  local_apic_timer_interrupt+0x35/0x60
> [57823.531331]  smp_apic_timer_interrupt+0x38/0x50
> [57823.552752]  apic_timer_interrupt+0x93/0xa0
> [57823.571887] RIP: 0010:panic+0x1f8/0x239
> [57823.589092] RSP: 0018:ffffc900072eba10 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff10
> [57823.623298] RAX: 0000000000000034 RBX: 0000000000000000 RCX: 0000000000000006
> [57823.655358] RDX: 0000000000000000 RSI: 0000000000000046 RDI: ffff88085fccdfe0
> [57823.687307] RBP: ffffc900072eba80 R08: 00000000fffffffe R09: 000000000000915a
> [57823.719315] R10: 0000000000000005 R11: 0000000000009159 R12: ffffffff81a2f668
> [57823.751391] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000046
> [57823.783944]  </IRQ>
> [57823.795292]  oops_end+0xb8/0xd0
> [57823.812182]  no_context+0x19e/0x3f0
> [57823.827792]  ? select_idle_sibling+0x2c/0x3d0
> [57823.847287]  __bad_area_nosemaphore+0xee/0x1d0
> [57823.867166]  ? __enqueue_entity+0x6c/0x70
> [57823.885089]  bad_area_nosemaphore+0x14/0x20
> [57823.903814]  __do_page_fault+0x89/0x4a0
> [57823.921531]  ? check_preempt_wakeup+0x106/0x230
> [57823.941952]  do_page_fault+0x30/0x80
> [57823.958382]  page_fault+0x28/0x30
> [57823.973847] RIP: 0010:xlog_write+0x243/0x7b0 [xfs]
> [57823.996259] RSP: 0018:ffffc900072ebcc8 EFLAGS: 00010246
> [57824.019806] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
> [57824.051819] RDX: 0000000000000006 RSI: 0000000000000000 RDI: 0000000000000000
> [57824.083915] RBP: ffffc900072ebd68 R08: 0000000000000600 R09: 0000000000040000
> [57824.116122] R10: 0000000000000000 R11: 0000000000000006 R12: 0000000000000000
> [57824.148050] R13: ffff880804f302e0 R14: ffff880853288000 R15: ffffc90024a01600
> [57824.179966]  ? xlog_write+0x762/0x7b0 [xfs]
> [57824.198643]  xlog_cil_push+0x2a6/0x470 [xfs]
> [57824.217802]  xlog_cil_push_work+0x15/0x20 [xfs]
> [57824.238297]  process_one_work+0x165/0x410
> [57824.256209]  worker_thread+0x27f/0x4c0
> [57824.272976]  kthread+0x101/0x140
> [57824.288084]  ? rescuer_thread+0x3b0/0x3b0
> [57824.308736]  ? kthread_park+0x90/0x90
> [57824.327867]  ? do_syscall_64+0x165/0x180
> [57824.345398]  ret_from_fork+0x2c/0x40
> [57824.361034] ---[ end trace 041d7b1a49184127 ]---
> [57824.381686] ------------[ cut here ]------------
> [57824.402342] WARNING: CPU: 3 PID: 29556 at arch/x86/kernel/smp.c:127 native_smp_send_reschedule+0x3f/0x50
> [57824.445004] Modules linked in: binfmt_misc xt_CHECKSUM ipt_MASQUERADE nf_nat_masquerade_ipv4 tun ip6t_rpfilter ipt_REJECT nf_reject_ipv4 ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter intel_rapl sb_edac edac_core x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel iTCO_wdt crypto_simd iTCO_vendor_support glue_helper ipmi_ssif cryptd hpilo hpwdt pcspkr ipmi_si i2c_i801 lpc_ich
> [57824.763533]  ioatdma nfsd sg ipmi_devintf shpchp dca pcc_cpufreq ipmi_msghandler wmi acpi_power_meter acpi_cpufreq auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c sd_mod mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm tg3 uas drm ptp usb_storage serio_raw hpsa crc32c_intel i2c_core pps_core fjes scsi_transport_sas dm_mirror dm_region_hash dm_log dm_mod
> [57824.931113] CPU: 3 PID: 29556 Comm: kworker/3:5 Tainted: G      D W       4.10.0-rc4.xfs305+ #22
> [57824.970667] Hardware name: HP ProLiant DL360 Gen9, BIOS P89 05/06/2015
> [57824.999927] Workqueue: xfs-cil/dm-5 xlog_cil_push_work [xfs]
> [57825.026452] Call Trace:
> [57825.037533]  <IRQ>
> [57825.046518]  dump_stack+0x63/0x87
> [57825.061285]  __warn+0xd1/0xf0
> [57825.074564]  warn_slowpath_null+0x1d/0x20
> [57825.092513]  native_smp_send_reschedule+0x3f/0x50
> [57825.113571]  resched_curr+0xa1/0xc0
> [57825.129181]  check_preempt_curr+0x70/0x90
> [57825.146768]  ttwu_do_wakeup+0x19/0xe0
> [57825.163141]  ttwu_do_activate+0x6f/0x80
> [57825.180275]  try_to_wake_up+0x1aa/0x3b0
> [57825.197440]  default_wake_function+0x12/0x20
> [57825.216596]  pollwake+0x73/0x90
> [57825.230654]  ? wake_up_q+0x80/0x80
> [57825.246033]  __wake_up_common+0x55/0x90
> [57825.263171]  __wake_up+0x39/0x50
> [57825.277624]  credit_entropy_bits+0x1fe/0x2a0
> [57825.296829]  ? add_interrupt_randomness+0x1b9/0x210
> [57825.320068]  add_interrupt_randomness+0x1b9/0x210
> [57825.345255]  handle_irq_event_percpu+0x40/0x80
> [57825.365728]  handle_irq_event+0x3b/0x60
> [57825.382876]  handle_edge_irq+0x8d/0x130
> [57825.400120]  handle_irq+0xab/0x130
> [57825.415340]  do_IRQ+0x48/0xd0
> [57825.428621]  common_interrupt+0x93/0x93
> [57825.446017] RIP: 0010:__do_softirq+0x6d/0x28c
> [57825.465570] RSP: 0018:ffff88085fcc3f68 EFLAGS: 00000206 ORIG_RAX: ffffffffffffff1b
> [57825.499744] RAX: ffff880804462d00 RBX: 0000000000000000 RCX: 0000000000000282
> [57825.533033] RDX: 00000000000193fa RSI: 00000000f2fa3225 RDI: 00000000000006e0
> [57825.565178] RBP: ffff88085fcc3fb8 R08: 00003496f5963280 R09: 0000000000000000
> [57825.597220] R10: 0000000000000003 R11: 0000000000000020 R12: ffffffff81a2f668
> [57825.629321] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000046
> [57825.661321]  irq_exit+0xd9/0xf0
> [57825.675454]  smp_apic_timer_interrupt+0x3d/0x50
> [57825.695743]  apic_timer_interrupt+0x93/0xa0
> [57825.714467] RIP: 0010:panic+0x1f8/0x239
> [57825.731620] RSP: 0018:ffffc900072eba10 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff10
> [57825.765508] RAX: 0000000000000034 RBX: 0000000000000000 RCX: 0000000000000006
> [57825.797718] RDX: 0000000000000000 RSI: 0000000000000046 RDI: ffff88085fccdfe0
> [57825.831010] RBP: ffffc900072eba80 R08: 00000000fffffffe R09: 000000000000915a
> [57825.866503] R10: 0000000000000005 R11: 0000000000009159 R12: ffffffff81a2f668
> [57825.898470] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000046
> [57825.930824]  </IRQ>
> [57825.940296]  oops_end+0xb8/0xd0
> [57825.954347]  no_context+0x19e/0x3f0
> [57825.969927]  ? select_idle_sibling+0x2c/0x3d0
> [57825.989418]  __bad_area_nosemaphore+0xee/0x1d0
> [57826.009319]  ? __enqueue_entity+0x6c/0x70
> [57826.027283]  bad_area_nosemaphore+0x14/0x20
> [57826.046834]  __do_page_fault+0x89/0x4a0
> [57826.064164]  ? check_preempt_wakeup+0x106/0x230
> [57826.084437]  do_page_fault+0x30/0x80
> [57826.100450]  page_fault+0x28/0x30
> [57826.115307] RIP: 0010:xlog_write+0x243/0x7b0 [xfs]
> [57826.136805] RSP: 0018:ffffc900072ebcc8 EFLAGS: 00010246
> [57826.160181] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
> [57826.192114] RDX: 0000000000000006 RSI: 0000000000000000 RDI: 0000000000000000
> [57826.224161] RBP: ffffc900072ebd68 R08: 0000000000000600 R09: 0000000000040000
> [57826.256406] R10: 0000000000000000 R11: 0000000000000006 R12: 0000000000000000
> [57826.288363] R13: ffff880804f302e0 R14: ffff880853288000 R15: ffffc90024a01600
> [57826.320430]  ? xlog_write+0x762/0x7b0 [xfs]
> [57826.341434]  xlog_cil_push+0x2a6/0x470 [xfs]
> [57826.364361]  xlog_cil_push_work+0x15/0x20 [xfs]
> [57826.384629]  process_one_work+0x165/0x410
> [57826.402564]  worker_thread+0x27f/0x4c0
> [57826.419452]  kthread+0x101/0x140
> [57826.434012]  ? rescuer_thread+0x3b0/0x3b0
> [57826.452045]  ? kthread_park+0x90/0x90
> [57826.468470]  ? do_syscall_64+0x165/0x180
> [57826.486147]  ret_from_fork+0x2c/0x40
> [57826.502218] ---[ end trace 041d7b1a49184128 ]---
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/5] xfs: bypass dquot reclaim to avoid quotacheck deadlock
  2017-02-16 22:37   ` Dave Chinner
@ 2017-02-17 18:30     ` Brian Foster
  2017-02-17 23:12       ` Dave Chinner
  0 siblings, 1 reply; 23+ messages in thread
From: Brian Foster @ 2017-02-17 18:30 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, Eryu Guan

On Fri, Feb 17, 2017 at 09:37:05AM +1100, Dave Chinner wrote:
> On Wed, Feb 15, 2017 at 10:40:43AM -0500, Brian Foster wrote:
> > Reclaim during quotacheck can lead to deadlocks on the dquot flush lock:
> > 
> >  - Quotacheck populates a local delwri queue with the physical dquot
> >    buffers.
> >  - Quotacheck performs the xfs_qm_dqusage_adjust() bulkstat and dirties
> >    all of the dquots.
> >  - Reclaim kicks in and attempts to flush a dquot whose buffer is
> >    already queud on the quotacheck queue. The flush succeeds but
> >    queueing to the reclaim delwri queue fails as the backing buffer is
> >    already queued. The flush unlock is now deferred to I/O completion of
> >    the buffer from the quotacheck queue.
> >  - Quotacheck proceeds to the xfs_qm_flush_one() walk which requires the
> >    flush lock to update the backing buffers with the in-core
> >    recalculated values. This deadlocks as the flush lock was acquired by
> >    reclaim but the buffer never submitted for I/O as it already resided
> >    on the quotacheck queue.
> > 
> > This is reproduced by running quotacheck on a filesystem with a couple
> > million inodes in low memory (512MB-1GB) situations.
> > 
> > Quotacheck first resets and collects the physical dquot buffers in a
> > delwri queue. Then, it traverses the filesystem inodes via bulkstat,
> > updates the in-core dquots, flushes the corrected dquots to the backing
> > buffers and finally submits the delwri queue for I/O. Since the backing
> > buffers are queued across the entire quotacheck operation, dquot reclaim
> > cannot possibly complete a dquot flush before quotacheck completes.
> > Therefore, dquot reclaim provides no real value during quotacheck.
> 
> Which is an OOM vector on systems with lots of dquots and low memory
> at mount time. Turning off dquot reclaim doesn't change that.
> 

It's not intended to. The inefficient design of quotacheck wrt to memory
usage is a separate problem. AFAICT, that problem goes back as far as my
git tree does (2.6.xx).

Looking back through the git history, this deadlock appears to be a
regression as of commit 43ff2122e6 ("xfs: on-stack delayed write buffer
lists") in 2012, which replaced a trylock/push sequence from the
quotacheck flush walk with a synchronous flush lock.

> Address the root cause - the buffer list is never flushed and so
> pins the memory quotacheck requires, so we need to flush the buffer
> list more often.  We also need to submit the buffer list before the
> flush walks begin, thereby unlocking all the dquots before doing the
> flush walk and avoiding the deadlock.
> 

I acknowledge that this patch doesn't address the root problem, but
neither does this proposal. The root cause is not that quotacheck uses
too much memory, it's that the serialization between quotacheck and
dquot reclaim is bogus.

For example, if we drop the buffer list being held across the adjust and
flush walks, we do indeed unpin the buffers and dquots for memory
reclaim, particularly during the dqadjust bulkstat. The flush walk then
still has to cycle the flush lock of every dquot to guarantee we wait on
dquots being flushed by reclaim (otherwise we risk corruption with the
xfs_qm_flush_one() logic is it is today). The same deadlock vector still
exists, albeit with a smaller window, if reclaim happens to flush a
dquot that is backed by a buffer that has already been queued by the
flush walk. AFAICT, the only requirements for this to occur are a couple
of dquots on a single backing buffer and some memory pressure from
somewhere, so the local memory usage is irrelevant.

I'm specifically trying to address the deadlock problem here. If we
don't want to bypass reclaim, the other options I see are to rework the
logic on the reclaim side to check the buffer state before we flush the
dquot, or push on the buffer from the quotacheck side like we used to
before the commit above. I think the latter may be more robust as it
isolates the logic to quotacheck. The risk there is that we're open to
pushing buffers off the reclaim list, but that might be fine so long as
we have the buffer lock.

I'm open to other ideas of course, but so far "flushing the buffer list
more often" isn't clear enough to suggest it actually solves the locking
problem... I'll put a quotacheck rework on my todo list as well, but
that's something I'll have to get to separately (but I'm happy to help
if anybody else wants to jump on that sooner.. ;).

Brian

> -Dave.
> 
> -- 
> Dave Chinner
> david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/5] xfs: bypass dquot reclaim to avoid quotacheck deadlock
  2017-02-17 18:30     ` Brian Foster
@ 2017-02-17 23:12       ` Dave Chinner
  2017-02-18 12:55         ` Brian Foster
  0 siblings, 1 reply; 23+ messages in thread
From: Dave Chinner @ 2017-02-17 23:12 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs, Eryu Guan

On Fri, Feb 17, 2017 at 01:30:09PM -0500, Brian Foster wrote:
> On Fri, Feb 17, 2017 at 09:37:05AM +1100, Dave Chinner wrote:
> > On Wed, Feb 15, 2017 at 10:40:43AM -0500, Brian Foster wrote:
> > > Reclaim during quotacheck can lead to deadlocks on the dquot flush lock:
> > > 
> > >  - Quotacheck populates a local delwri queue with the physical dquot
> > >    buffers.
> > >  - Quotacheck performs the xfs_qm_dqusage_adjust() bulkstat and dirties
> > >    all of the dquots.
> > >  - Reclaim kicks in and attempts to flush a dquot whose buffer is
> > >    already queud on the quotacheck queue. The flush succeeds but
> > >    queueing to the reclaim delwri queue fails as the backing buffer is
> > >    already queued. The flush unlock is now deferred to I/O completion of
> > >    the buffer from the quotacheck queue.
> > >  - Quotacheck proceeds to the xfs_qm_flush_one() walk which requires the
> > >    flush lock to update the backing buffers with the in-core
> > >    recalculated values. This deadlocks as the flush lock was acquired by
> > >    reclaim but the buffer never submitted for I/O as it already resided
> > >    on the quotacheck queue.
> > > 
> > > This is reproduced by running quotacheck on a filesystem with a couple
> > > million inodes in low memory (512MB-1GB) situations.
> > > 
> > > Quotacheck first resets and collects the physical dquot buffers in a
> > > delwri queue. Then, it traverses the filesystem inodes via bulkstat,
> > > updates the in-core dquots, flushes the corrected dquots to the backing
> > > buffers and finally submits the delwri queue for I/O. Since the backing
> > > buffers are queued across the entire quotacheck operation, dquot reclaim
> > > cannot possibly complete a dquot flush before quotacheck completes.
> > > Therefore, dquot reclaim provides no real value during quotacheck.
> > 
> > Which is an OOM vector on systems with lots of dquots and low memory
> > at mount time. Turning off dquot reclaim doesn't change that.
> > 
> 
> It's not intended to. The inefficient design of quotacheck wrt to memory
> usage is a separate problem. AFAICT, that problem goes back as far as my
> git tree does (2.6.xx).
> 
> Looking back through the git history, this deadlock appears to be a
> regression as of commit 43ff2122e6 ("xfs: on-stack delayed write buffer
> lists") in 2012, which replaced a trylock/push sequence from the
> quotacheck flush walk with a synchronous flush lock.

Right - that's the commit that changed the code to what it is now.
That's what I implied needs fixing....

> > Address the root cause - the buffer list is never flushed and so
> > pins the memory quotacheck requires, so we need to flush the buffer
> > list more often.  We also need to submit the buffer list before the
> > flush walks begin, thereby unlocking all the dquots before doing the
> > flush walk and avoiding the deadlock.
> > 
> 
> I acknowledge that this patch doesn't address the root problem, but
> neither does this proposal. The root cause is not that quotacheck uses
> too much memory, it's that the serialization between quotacheck and
> dquot reclaim is bogus.
> 
> For example, if we drop the buffer list being held across the adjust and
> flush walks, we do indeed unpin the buffers and dquots for memory
> reclaim, particularly during the dqadjust bulkstat. The flush walk then
> still has to cycle the flush lock of every dquot to guarantee we wait on
> dquots being flushed by reclaim (otherwise we risk corruption with the
> xfs_qm_flush_one() logic is it is today). The same deadlock vector still
> exists, albeit with a smaller window, if reclaim happens to flush a
> dquot that is backed by a buffer that has already been queued by the
> flush walk. AFAICT, the only requirements for this to occur are a couple
> of dquots on a single backing buffer and some memory pressure from
> somewhere, so the local memory usage is irrelevant.

Except that the window is *tiny* because the dquots are walked in
order and so all the dquots are processed in ascending order. This
means all the dquots in a buffer are flushed sequentially, and each
dquot lookup resets the reclaim reference count count on the dquot
buffer. Hence it is extremely unlikely that a dquot buffer will be
reclaimed while it is partially flushed during the flush walk.

> I'm specifically trying to address the deadlock problem here. If we
> don't want to bypass reclaim, the other options I see are to rework the
> logic on the reclaim side to check the buffer state before we flush the
> dquot, or push on the buffer from the quotacheck side like we used to
> before the commit above.

The latter is effectively what flushing the buffer list before the
flush walk is run does.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/5] xfs: bypass dquot reclaim to avoid quotacheck deadlock
  2017-02-17 23:12       ` Dave Chinner
@ 2017-02-18 12:55         ` Brian Foster
  0 siblings, 0 replies; 23+ messages in thread
From: Brian Foster @ 2017-02-18 12:55 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, Eryu Guan

On Sat, Feb 18, 2017 at 10:12:15AM +1100, Dave Chinner wrote:
> On Fri, Feb 17, 2017 at 01:30:09PM -0500, Brian Foster wrote:
> > On Fri, Feb 17, 2017 at 09:37:05AM +1100, Dave Chinner wrote:
> > > On Wed, Feb 15, 2017 at 10:40:43AM -0500, Brian Foster wrote:
> > > > Reclaim during quotacheck can lead to deadlocks on the dquot flush lock:
> > > > 
> > > >  - Quotacheck populates a local delwri queue with the physical dquot
> > > >    buffers.
> > > >  - Quotacheck performs the xfs_qm_dqusage_adjust() bulkstat and dirties
> > > >    all of the dquots.
> > > >  - Reclaim kicks in and attempts to flush a dquot whose buffer is
> > > >    already queud on the quotacheck queue. The flush succeeds but
> > > >    queueing to the reclaim delwri queue fails as the backing buffer is
> > > >    already queued. The flush unlock is now deferred to I/O completion of
> > > >    the buffer from the quotacheck queue.
> > > >  - Quotacheck proceeds to the xfs_qm_flush_one() walk which requires the
> > > >    flush lock to update the backing buffers with the in-core
> > > >    recalculated values. This deadlocks as the flush lock was acquired by
> > > >    reclaim but the buffer never submitted for I/O as it already resided
> > > >    on the quotacheck queue.
> > > > 
> > > > This is reproduced by running quotacheck on a filesystem with a couple
> > > > million inodes in low memory (512MB-1GB) situations.
> > > > 
> > > > Quotacheck first resets and collects the physical dquot buffers in a
> > > > delwri queue. Then, it traverses the filesystem inodes via bulkstat,
> > > > updates the in-core dquots, flushes the corrected dquots to the backing
> > > > buffers and finally submits the delwri queue for I/O. Since the backing
> > > > buffers are queued across the entire quotacheck operation, dquot reclaim
> > > > cannot possibly complete a dquot flush before quotacheck completes.
> > > > Therefore, dquot reclaim provides no real value during quotacheck.
> > > 
> > > Which is an OOM vector on systems with lots of dquots and low memory
> > > at mount time. Turning off dquot reclaim doesn't change that.
> > > 
> > 
> > It's not intended to. The inefficient design of quotacheck wrt to memory
> > usage is a separate problem. AFAICT, that problem goes back as far as my
> > git tree does (2.6.xx).
> > 
> > Looking back through the git history, this deadlock appears to be a
> > regression as of commit 43ff2122e6 ("xfs: on-stack delayed write buffer
> > lists") in 2012, which replaced a trylock/push sequence from the
> > quotacheck flush walk with a synchronous flush lock.
> 
> Right - that's the commit that changed the code to what it is now.
> That's what I implied needs fixing....
> 

Ok, but that commit has nothing to do with the design of quotacheck.

> > > Address the root cause - the buffer list is never flushed and so
> > > pins the memory quotacheck requires, so we need to flush the buffer
> > > list more often.  We also need to submit the buffer list before the
> > > flush walks begin, thereby unlocking all the dquots before doing the
> > > flush walk and avoiding the deadlock.
> > > 
> > 
> > I acknowledge that this patch doesn't address the root problem, but
> > neither does this proposal. The root cause is not that quotacheck uses
> > too much memory, it's that the serialization between quotacheck and
> > dquot reclaim is bogus.
> > 
> > For example, if we drop the buffer list being held across the adjust and
> > flush walks, we do indeed unpin the buffers and dquots for memory
> > reclaim, particularly during the dqadjust bulkstat. The flush walk then
> > still has to cycle the flush lock of every dquot to guarantee we wait on
> > dquots being flushed by reclaim (otherwise we risk corruption with the
> > xfs_qm_flush_one() logic is it is today). The same deadlock vector still
> > exists, albeit with a smaller window, if reclaim happens to flush a
> > dquot that is backed by a buffer that has already been queued by the
> > flush walk. AFAICT, the only requirements for this to occur are a couple
> > of dquots on a single backing buffer and some memory pressure from
> > somewhere, so the local memory usage is irrelevant.
> 
> Except that the window is *tiny* because the dquots are walked in
> order and so all the dquots are processed in ascending order. This
> means all the dquots in a buffer are flushed sequentially, and each
> dquot lookup resets the reclaim reference count count on the dquot
> buffer. Hence it is extremely unlikely that a dquot buffer will be
> reclaimed while it is partially flushed during the flush walk.
> 

At a high level, this fundamentally makes dquot reclaim safe during the
dqadjust bulkstat. We are susceptible to the same exact problem during
the flush walk... it is true that the race window is now reduced to the
time spent processing dquots for a particular buffer at time. The logic
used to explain away the race here doesn't make a lot of sense,
however...

First, the reference count on the underlying buffer is irrelevant
because dquot reclaim is what causes the race. All dquot reference
counts are zero when the flush walk begins. In other words, all in-core
dquots are on the lru and thus subject to reclaim at this point in time.

Second, the window is definitely not tiny. According to xfs_qm.h, we
have 30 dquots per 4k block. The only way a flush via dquot reclaim does
_not_ deadlock is if reclaim is first to flush a dquot of a particular
buffer (such that it can queue the buffer). In most reclaim scans this
may work fine, but if dquots are flushed in ascending order, that means
if reclaim ever has to flush a dquot, it's pretty much caught up to the
flush walk or will intersect eventually.

Further, the flush walk code iterates 32 dquots at a time. That's
effectively a qi_tree_lock cycle and radix tree lookup per dquot buffer.
With thousands of dquots around in memory, I think that presents ample
opportunity for reclaim to kick in and flush a dquot that's not the
first to be flushed to its backing buffer. All it probably takes is for
the tree lock cycle and the radix tree lookup for the next batch of 32
dquots to occur before the end of a particular buffer to present an even
bigger window for reclaim to kick in and flush one of the remaining
dquots on that particular buffer. Again, that's going to occur for
almost every dquot buffer.

This is all kind of pointless analysis anyways because if the buffer
list is submitted before the flush walk, the flush walk in its current
form is not sufficient to serialize quotacheck against I/O completion of
underlying dquot buffers. dquot reclaim can now flush and free any
number of dquots for which quotacheck won't have any window into the
state of the underlying buffer I/O. This probably means the
xfs_qm_dquot_walk() (radix tree) based flush walk needs to be replaced
with something that visits every dquot (or dquot buffer) in the fs,
reallocating them if necessary (which adds more flush walk -> reclaim
contention than we've considered above), in order to serialize against
the underlying buffer. Either way, to just submit the buffer list before
the flush walk and make no other changes to account for the lost buffer
serialization is to break quotacheck.

> > I'm specifically trying to address the deadlock problem here. If we
> > don't want to bypass reclaim, the other options I see are to rework the
> > logic on the reclaim side to check the buffer state before we flush the
> > dquot, or push on the buffer from the quotacheck side like we used to
> > before the commit above.
> 
> The latter is effectively what flushing the buffer list before the
> flush walk is run does.
> 

The buffer push forcibly writes out the buffer for any dquot that is
found flush locked during the quotacheck flush walk and waits on the
flush lock again to serialize against the write and allow us to flush
such dquots to the buffer without a deadlock. Flushing the buffer list
before the flush walk doesn't protect us from deadlock once the flush
walk actually begins, and creates new serialization problems with
broader quotacheck that can cause corruption if not handled properly.

I can try to put something together next week to resurrect the buffer
push similar to as implemented historically prior to commit 43ff2122e6
if you're Ok with that approach for the locking problem (or have some
other suggestion that doesn't handwave away the serialization problem
;)... Otherwise, I guess we can leave things as they are. I don't want
to rely on things like indeterminate cache walk and reclaim ordering to
try and justify whether a clear serialization problem is going to
trigger or not in various high/low memory/dquot configurations. Even if
it doesn't trigger right now, if we change something innocuous down the
road like the dquot buffer size then all of this logic goes out the
window and we have a "new" deadlock to track and down and resolve..

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/5] xfs: quota deadlock fixes
  2017-02-17 17:54   ` Brian Foster
@ 2017-02-20  3:52     ` Eryu Guan
  2017-02-20 13:25       ` Brian Foster
  0 siblings, 1 reply; 23+ messages in thread
From: Eryu Guan @ 2017-02-20  3:52 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs, Dave Chinner

[-- Attachment #1: Type: text/plain, Size: 2302 bytes --]

On Fri, Feb 17, 2017 at 12:54:54PM -0500, Brian Foster wrote:
> On Fri, Feb 17, 2017 at 02:53:15PM +0800, Eryu Guan wrote:
> > On Wed, Feb 15, 2017 at 10:40:42AM -0500, Brian Foster wrote:
> > > Hi all,
> > > 
> > > This is a collection of several quota related deadlock fixes for
> > > problems that have been reported to the list recently.
> > > 
> > > Patch 1 fixes the low memory quotacheck problem reported by Martin[1].
> > > Dave is CC'd as he had comments on this particular thread that started a
> > > discussion, but I hadn't heard anything back since my last response.
> > > 
> > > Patch 2 fixes a separate problem I ran into while attempting to
> > > reproduce Eryu's xfs/305 hang report[2]. 
> > > 
> > > Patches 3-5 fix the actual problem reported by Eryu, which is a quotaoff
> > > deadlock reproduced by xfs/305.
> > > 
> > > Further details are included in the individual commit log descriptions.
> > > Thoughts, reviews, flames appreciated.
> > > 
> > > Eryu,
> > > 
> > > I've run several hundred iterations of this on your reproducer system
> > > without reproducing the hang. I have reproduced a reset overnight but
> > > still haven't been able to grab a stack trace from that occurrence (I'll
> > > try again today/tonight with better console logging). I suspect this is
> > 
> > I hit a NULL pointer dereference while testing your fix, I was running
> > xfs/305 for 1000 iterations and host crashed at the 639th run. Not sure
> > if it's the same issue you've met here. I posted dmesg log at the end of
> > mail. I haven't tried to see if I can reproduce it on stock linus tree
> > yet.
> > 
> 
> Interesting, thanks. I don't know for sure because I didn't hit anything
> on my second overnight run, but I wouldn't be surprised if it's the
> same, particularly if you hit this again. This does look like an
> independent problem to me, though. A kdump might be nice, if possible,
> given the difficulty to reproduce...

Unfortunately, my second round of 1000 iteration run hit hang too, at
the 824th loop.  Test configuration is all default, crc enabled XFS with
4k block size, no rmapbt no reflink no finobt no sparse inode.

I attached the dmesg log and sysrq-w output. I also left the host in the
hang state, you can login and take a look if you have interest.

Thanks,
Eryu

[-- Attachment #2: upstream-4.10-rc4-xfs-305-testpatch-sysrq-w.log.bz2 --]
[-- Type: application/x-bzip2, Size: 22771 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/5] xfs: quota deadlock fixes
  2017-02-20  3:52     ` Eryu Guan
@ 2017-02-20 13:25       ` Brian Foster
  2017-02-22 15:35         ` Brian Foster
  0 siblings, 1 reply; 23+ messages in thread
From: Brian Foster @ 2017-02-20 13:25 UTC (permalink / raw)
  To: Eryu Guan; +Cc: linux-xfs, Dave Chinner

On Mon, Feb 20, 2017 at 11:52:56AM +0800, Eryu Guan wrote:
> On Fri, Feb 17, 2017 at 12:54:54PM -0500, Brian Foster wrote:
> > On Fri, Feb 17, 2017 at 02:53:15PM +0800, Eryu Guan wrote:
> > > On Wed, Feb 15, 2017 at 10:40:42AM -0500, Brian Foster wrote:
> > > > Hi all,
> > > > 
> > > > This is a collection of several quota related deadlock fixes for
> > > > problems that have been reported to the list recently.
> > > > 
> > > > Patch 1 fixes the low memory quotacheck problem reported by Martin[1].
> > > > Dave is CC'd as he had comments on this particular thread that started a
> > > > discussion, but I hadn't heard anything back since my last response.
> > > > 
> > > > Patch 2 fixes a separate problem I ran into while attempting to
> > > > reproduce Eryu's xfs/305 hang report[2]. 
> > > > 
> > > > Patches 3-5 fix the actual problem reported by Eryu, which is a quotaoff
> > > > deadlock reproduced by xfs/305.
> > > > 
> > > > Further details are included in the individual commit log descriptions.
> > > > Thoughts, reviews, flames appreciated.
> > > > 
> > > > Eryu,
> > > > 
> > > > I've run several hundred iterations of this on your reproducer system
> > > > without reproducing the hang. I have reproduced a reset overnight but
> > > > still haven't been able to grab a stack trace from that occurrence (I'll
> > > > try again today/tonight with better console logging). I suspect this is
> > > 
> > > I hit a NULL pointer dereference while testing your fix, I was running
> > > xfs/305 for 1000 iterations and host crashed at the 639th run. Not sure
> > > if it's the same issue you've met here. I posted dmesg log at the end of
> > > mail. I haven't tried to see if I can reproduce it on stock linus tree
> > > yet.
> > > 
> > 
> > Interesting, thanks. I don't know for sure because I didn't hit anything
> > on my second overnight run, but I wouldn't be surprised if it's the
> > same, particularly if you hit this again. This does look like an
> > independent problem to me, though. A kdump might be nice, if possible,
> > given the difficulty to reproduce...
> 
> Unfortunately, my second round of 1000 iteration run hit hang too, at
> the 824th loop.  Test configuration is all default, crc enabled XFS with
> 4k block size, no rmapbt no reflink no finobt no sparse inode.
> 
> I attached the dmesg log and sysrq-w output. I also left the host in the
> hang state, you can login and take a look if you have interest.
> 

Hmm, Ok thanks. This one looks more like the original problem.
Everything is waiting on log reservation, the AIL is spinning on the
locked quotaoff start log item, and the quotaoff purge sequence appears
to be spinning on a dquot.

Unfortunately, I can't tell why quotaoff is spinning. stap doesn't seem
to compile anything on this box after a quick try, so I'll probably have
to reinstall some of the debug code on top and (hopefully) reproduce.
I'm guessing it's a similar dquot reference count issue, but it may or
may not be the same since this one appears significantly harder to
reproduce than the original...

Brian

> Thanks,
> Eryu



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/5] xfs: quota deadlock fixes
  2017-02-20 13:25       ` Brian Foster
@ 2017-02-22 15:35         ` Brian Foster
  0 siblings, 0 replies; 23+ messages in thread
From: Brian Foster @ 2017-02-22 15:35 UTC (permalink / raw)
  To: Eryu Guan; +Cc: linux-xfs, Dave Chinner

On Mon, Feb 20, 2017 at 08:25:35AM -0500, Brian Foster wrote:
> On Mon, Feb 20, 2017 at 11:52:56AM +0800, Eryu Guan wrote:
> > On Fri, Feb 17, 2017 at 12:54:54PM -0500, Brian Foster wrote:
> > > On Fri, Feb 17, 2017 at 02:53:15PM +0800, Eryu Guan wrote:
> > > > On Wed, Feb 15, 2017 at 10:40:42AM -0500, Brian Foster wrote:
> > > > > Hi all,
> > > > > 
> > > > > This is a collection of several quota related deadlock fixes for
> > > > > problems that have been reported to the list recently.
> > > > > 
> > > > > Patch 1 fixes the low memory quotacheck problem reported by Martin[1].
> > > > > Dave is CC'd as he had comments on this particular thread that started a
> > > > > discussion, but I hadn't heard anything back since my last response.
> > > > > 
> > > > > Patch 2 fixes a separate problem I ran into while attempting to
> > > > > reproduce Eryu's xfs/305 hang report[2]. 
> > > > > 
> > > > > Patches 3-5 fix the actual problem reported by Eryu, which is a quotaoff
> > > > > deadlock reproduced by xfs/305.
> > > > > 
> > > > > Further details are included in the individual commit log descriptions.
> > > > > Thoughts, reviews, flames appreciated.
> > > > > 
> > > > > Eryu,
> > > > > 
> > > > > I've run several hundred iterations of this on your reproducer system
> > > > > without reproducing the hang. I have reproduced a reset overnight but
> > > > > still haven't been able to grab a stack trace from that occurrence (I'll
> > > > > try again today/tonight with better console logging). I suspect this is
> > > > 
> > > > I hit a NULL pointer dereference while testing your fix, I was running
> > > > xfs/305 for 1000 iterations and host crashed at the 639th run. Not sure
> > > > if it's the same issue you've met here. I posted dmesg log at the end of
> > > > mail. I haven't tried to see if I can reproduce it on stock linus tree
> > > > yet.
> > > > 
> > > 
> > > Interesting, thanks. I don't know for sure because I didn't hit anything
> > > on my second overnight run, but I wouldn't be surprised if it's the
> > > same, particularly if you hit this again. This does look like an
> > > independent problem to me, though. A kdump might be nice, if possible,
> > > given the difficulty to reproduce...
> > 
> > Unfortunately, my second round of 1000 iteration run hit hang too, at
> > the 824th loop.  Test configuration is all default, crc enabled XFS with
> > 4k block size, no rmapbt no reflink no finobt no sparse inode.
> > 
> > I attached the dmesg log and sysrq-w output. I also left the host in the
> > hang state, you can login and take a look if you have interest.
> > 
> 
> Hmm, Ok thanks. This one looks more like the original problem.
> Everything is waiting on log reservation, the AIL is spinning on the
> locked quotaoff start log item, and the quotaoff purge sequence appears
> to be spinning on a dquot.
> 
> Unfortunately, I can't tell why quotaoff is spinning. stap doesn't seem
> to compile anything on this box after a quick try, so I'll probably have
> to reinstall some of the debug code on top and (hopefully) reproduce.
> I'm guessing it's a similar dquot reference count issue, but it may or
> may not be the same since this one appears significantly harder to
> reproduce than the original...
> 

I managed to reproduce with some of my old debug code. That code enabled
a walk of the inode space in the fs to see if any inodes still held
references to the dquot we're unable to purge. In this case, it looks
like we have a group dquot with an elevated reference count, but no
inode appears to have a reference to it. So somehow or another the
reference count appears to be broken...

I'm running again with more tracing to hopefully try and see if the
refcounting goes awry somehow. So far I'm unable to reproduce in some
~1200 iterations, but I'll leave it running. FWIW, I think this is
enough to say that this problem is independent from the one addressed by
the last few patches in this series (in which the dquot was legitimately
held by an inode by the time we attempt the purge).

Brian

> Brian
> 
> > Thanks,
> > Eryu
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/5] xfs: allocate quotaoff transactions up front to avoid log deadlock
  2017-02-15 15:40 ` [PATCH 2/5] xfs: allocate quotaoff transactions up front to avoid log deadlock Brian Foster
@ 2017-04-26 21:23   ` Darrick J. Wong
  2017-04-27 12:03     ` Brian Foster
  0 siblings, 1 reply; 23+ messages in thread
From: Darrick J. Wong @ 2017-04-26 21:23 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs, Eryu Guan, Dave Chinner

On Wed, Feb 15, 2017 at 10:40:44AM -0500, Brian Foster wrote:
> The quotaoff operation commits two explicitly synchronous
> transactions to correctly handle log recovery of dquots being
> modified at the time the quotaoff occurs. The first quotaoff
> transaction pins the tail of the log with a qoff logitem in the AIL
> to ensure further logged dquots stay ahead of the quotaoff operation
> in the log. The second quotaoff_end transaction is committed after
> the quotaoff operation completes, releases the original log item and
> thus unpins the log.
> 
> Since the first transaction pins the tail of the log, this means a
> finite amount of space in the log is available between the time a
> quotaoff starts and completes before transaction reservations have
> to block on the log. While the quotaoff operation itself consumes no
> further log space, it is possible for other operations in the
> filesystem to consume the remaining log space before the quotaoff
> completes. If this occurs, the quotaoff_end transaction also blocks
> on the log which prevents the release of the log item and the
> filesystem deadlocks. This has been reproduced via repeated xfs/305
> iterations on a vm with fairly limited resources.
> 
> To avoid a deadlock due to a particularly slow quotaoff operation,
> allocate the quotaoff_end transaction immediately after the initial
> quotaoff transaction is committed. Carry a reference to the
> transaction through xfs_qm_scall_quotaoff() rather than the qoff log
> item and commit it once the quotaoff completes.

Hmm... so instead of holding a pinned log item for however long it takes
to finish quotaoff, we instead hold a transaction.  That does fix the
problem where we pin the log tail and the head crashes into it, but also
comes with its own problems, doesn't it?

In xfs_qm_scall_quotaoff, we call xfs_qm_dqrele_all_inodes, which
calls xfs_dqrele_inode to call xfs_qm_dqrele on all the inodes in the
system.  However, if any of those inodes are unlinked and inactive, the
iput -> xfs_fs_destroy_inode -> xfs_inactive path truncates the inode,
which itself requires a transaction, so now we're running with nested
transactions.

I'm not sure that's necessarily a problem because the outer transaction
contains only a single qoffend item.  AFAICT I haven't encountered any
other places in XFS where a single thread runs nested transactions, but
it still seems suspect to me.

Heh, lockdep just spat this at me:

[19767.048839] =============================================
[19767.049699] [ INFO: possible recursive locking detected ]
[19767.050511] 4.11.0-rc4-djw #3 Not tainted
[19767.051122] ---------------------------------------------
[19767.051907] xfs_quota/4152 is trying to acquire lock:
[19767.052617]  (sb_internal){.+.+.+}, at: [<ffffffffa015ee13>] xfs_trans_alloc+0xe3/0x130 [xfs]
[19767.054151] 
[19767.054151] but task is already holding lock:
[19767.055246]  (sb_internal){.+.+.+}, at: [<ffffffffa015ee13>] xfs_trans_alloc+0xe3/0x130 [xfs]
[19767.056897] 
[19767.056897] other info that might help us debug this:
[19767.057980]  Possible unsafe locking scenario:
[19767.057980] 
[19767.058967]        CPU0
[19767.059385]        ----
[19767.060339]   lock(sb_internal);
[19767.061187]   lock(sb_internal);
[19767.061771] 
[19767.061771]  *** DEADLOCK ***
[19767.061771] 
[19767.062818]  May be due to missing lock nesting notation
[19767.062818] 
[19767.063967] 3 locks held by xfs_quota/4152:
[19767.064521]  #0:  (&type->s_umount_key#34){++++++}, at: [<ffffffff812287f0>] __get_super.part.18+0xc0/0xe0
[19767.065805]  #1:  (&qinf->qi_quotaofflock){+.+.+.}, at: [<ffffffffa0182a53>] xfs_qm_scall_quotaoff+0x53/0x6c0 [xfs]
[19767.067226]  #2:  (sb_internal){.+.+.+}, at: [<ffffffffa015ee13>] xfs_trans_alloc+0xe3/0x130 [xfs]
[19767.068543] 
[19767.068543] stack backtrace:
[19767.069191] CPU: 2 PID: 4152 Comm: xfs_quota Not tainted 4.11.0-rc4-djw #3
[19767.070124] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
[19767.071405] Call Trace:
[19767.071798]  dump_stack+0x85/0xc2
[19767.072298]  __lock_acquire+0x164e/0x16a0
[19767.072866]  ? kvm_sched_clock_read+0x9/0x20
[19767.073460]  ? sched_clock+0x9/0x10
[19767.073979]  lock_acquire+0xbf/0x210
[19767.074532]  ? xfs_trans_alloc+0xe3/0x130 [xfs]
[19767.075166]  __sb_start_write+0xc0/0x220
[19767.075782]  ? xfs_trans_alloc+0xe3/0x130 [xfs]
[19767.076475]  xfs_trans_alloc+0xe3/0x130 [xfs]
[19767.077138]  xfs_inactive_truncate+0x31/0x130 [xfs]
[19767.077862]  ? xfs_qm_dqattach+0x16/0x50 [xfs]
[19767.078524]  xfs_inactive+0x17d/0x270 [xfs]
[19767.079141]  xfs_fs_destroy_inode+0xb3/0x350 [xfs]
[19767.079964]  destroy_inode+0x3b/0x60
[19767.080679]  evict+0x132/0x190
[19767.081301]  iput+0x243/0x320
[19767.081961]  xfs_inode_ag_walk+0x2a8/0x690 [xfs]
[19767.082757]  ? xfs_inode_ag_walk+0x92/0x690 [xfs]
[19767.083545]  ? xfs_trans_free_dqinfo+0x30/0x30 [xfs]
[19767.084279]  ? xfs_perag_get+0xa8/0x2b0 [xfs]
[19767.084753]  ? xfs_perag_get+0xa8/0x2b0 [xfs]
[19767.085234]  xfs_inode_ag_iterator_flags+0x69/0xa0 [xfs]
[19767.086251]  ? xfs_trans_free_dqinfo+0x30/0x30 [xfs]
[19767.087219]  xfs_qm_dqrele_all_inodes+0x36/0x60 [xfs]
[19767.088431]  xfs_qm_scall_quotaoff+0x100/0x6c0 [xfs]
[19767.089285]  xfs_quota_disable+0x3d/0x50 [xfs]
[19767.090138]  SyS_quotactl+0x3ff/0x820
[19767.090853]  ? SYSC_newstat+0x3b/0x50
[19767.091501]  ? trace_hardirqs_on_caller+0x129/0x1b0
[19767.092398]  ? trace_hardirqs_on_thunk+0x1a/0x1c
[19767.093238]  entry_SYSCALL_64_fastpath+0x1f/0xc2
[19767.094150] RIP: 0033:0x7f43530a70ca
[19767.094851] RSP: 002b:00007ffe7570fbe8 EFLAGS: 00000206 ORIG_RAX: 00000000000000b3
[19767.096078] RAX: ffffffffffffffda RBX: 00007ffe7570fde8 RCX: 00007f43530a70ca
[19767.097242] RDX: 0000000000000000 RSI: 00000000012d26f0 RDI: 0000000000580202
[19767.098402] RBP: 0000000000000005 R08: 00007ffe7570fbfc R09: 0000000000005802
[19767.099645] R10: 00007ffe7570fbfc R11: 0000000000000206 R12: 0000000000401e50
[19767.100841] R13: 00007ffe7570fde0 R14: 0000000000000000 R15: 0000000000000000

--D

> 
> Signed-off-by: Brian Foster <bfoster@redhat.com>
> ---
>  fs/xfs/xfs_qm_syscalls.c | 42 ++++++++++++++++++++++++------------------
>  1 file changed, 24 insertions(+), 18 deletions(-)
> 
> diff --git a/fs/xfs/xfs_qm_syscalls.c b/fs/xfs/xfs_qm_syscalls.c
> index 475a388..dbb6802 100644
> --- a/fs/xfs/xfs_qm_syscalls.c
> +++ b/fs/xfs/xfs_qm_syscalls.c
> @@ -35,9 +35,11 @@
>  #include "xfs_trace.h"
>  #include "xfs_icache.h"
>  
> -STATIC int	xfs_qm_log_quotaoff(xfs_mount_t *, xfs_qoff_logitem_t **, uint);
> -STATIC int	xfs_qm_log_quotaoff_end(xfs_mount_t *, xfs_qoff_logitem_t *,
> -					uint);
> +STATIC int	xfs_qm_log_quotaoff(struct xfs_mount *, struct xfs_trans **,
> +				    uint);
> +STATIC int	xfs_qm_log_quotaoff_end(struct xfs_mount *,
> +					struct xfs_qoff_logitem *,
> +					struct xfs_trans **, uint);
>  
>  /*
>   * Turn off quota accounting and/or enforcement for all udquots and/or
> @@ -56,7 +58,7 @@ xfs_qm_scall_quotaoff(
>  	uint			dqtype;
>  	int			error;
>  	uint			inactivate_flags;
> -	xfs_qoff_logitem_t	*qoffstart;
> +	struct xfs_trans	*qend_tp;
>  
>  	/*
>  	 * No file system can have quotas enabled on disk but not in core.
> @@ -128,7 +130,7 @@ xfs_qm_scall_quotaoff(
>  	 * and synchronously. If we fail to write, we should abort the
>  	 * operation as it cannot be recovered safely if we crash.
>  	 */
> -	error = xfs_qm_log_quotaoff(mp, &qoffstart, flags);
> +	error = xfs_qm_log_quotaoff(mp, &qend_tp, flags);
>  	if (error)
>  		goto out_unlock;
>  
> @@ -181,7 +183,7 @@ xfs_qm_scall_quotaoff(
>  	 * So, we have QUOTAOFF start and end logitems; the start
>  	 * logitem won't get overwritten until the end logitem appears...
>  	 */
> -	error = xfs_qm_log_quotaoff_end(mp, qoffstart, flags);
> +	error = xfs_trans_commit(qend_tp);
>  	if (error) {
>  		/* We're screwed now. Shutdown is the only option. */
>  		xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
> @@ -556,13 +558,14 @@ xfs_qm_scall_setqlim(
>  
>  STATIC int
>  xfs_qm_log_quotaoff_end(
> -	xfs_mount_t		*mp,
> -	xfs_qoff_logitem_t	*startqoff,
> +	struct xfs_mount	*mp,
> +	struct xfs_qoff_logitem	*startqoff,
> +	struct xfs_trans	**tpp,
>  	uint			flags)
>  {
> -	xfs_trans_t		*tp;
> +	struct xfs_trans	*tp;
>  	int			error;
> -	xfs_qoff_logitem_t	*qoffi;
> +	struct xfs_qoff_logitem	*qoffi;
>  
>  	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_qm_equotaoff, 0, 0, 0, &tp);
>  	if (error)
> @@ -578,21 +581,22 @@ xfs_qm_log_quotaoff_end(
>  	 * We don't care about quotoff's performance.
>  	 */
>  	xfs_trans_set_sync(tp);
> -	return xfs_trans_commit(tp);
> +	*tpp = tp;
> +	return 0;
>  }
>  
>  
>  STATIC int
>  xfs_qm_log_quotaoff(
> -	xfs_mount_t	       *mp,
> -	xfs_qoff_logitem_t     **qoffstartp,
> -	uint		       flags)
> +	struct xfs_mount	*mp,
> +	struct xfs_trans	**end_tp,
> +	uint			flags)
>  {
> -	xfs_trans_t	       *tp;
> +	struct xfs_trans	*tp;
>  	int			error;
> -	xfs_qoff_logitem_t     *qoffi;
> +	struct xfs_qoff_logitem	*qoffi;
>  
> -	*qoffstartp = NULL;
> +	*end_tp = NULL;
>  
>  	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_qm_quotaoff, 0, 0, 0, &tp);
>  	if (error)
> @@ -617,7 +621,9 @@ xfs_qm_log_quotaoff(
>  	if (error)
>  		goto out;
>  
> -	*qoffstartp = qoffi;
> +	error = xfs_qm_log_quotaoff_end(mp, qoffi, end_tp, flags);
> +	if (error)
> +		xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
>  out:
>  	return error;
>  }
> -- 
> 2.7.4
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/5] xfs: allocate quotaoff transactions up front to avoid log deadlock
  2017-04-26 21:23   ` Darrick J. Wong
@ 2017-04-27 12:03     ` Brian Foster
  2017-04-27 15:47       ` Darrick J. Wong
  0 siblings, 1 reply; 23+ messages in thread
From: Brian Foster @ 2017-04-27 12:03 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, Eryu Guan, Dave Chinner

On Wed, Apr 26, 2017 at 02:23:33PM -0700, Darrick J. Wong wrote:
> On Wed, Feb 15, 2017 at 10:40:44AM -0500, Brian Foster wrote:
> > The quotaoff operation commits two explicitly synchronous
> > transactions to correctly handle log recovery of dquots being
> > modified at the time the quotaoff occurs. The first quotaoff
> > transaction pins the tail of the log with a qoff logitem in the AIL
> > to ensure further logged dquots stay ahead of the quotaoff operation
> > in the log. The second quotaoff_end transaction is committed after
> > the quotaoff operation completes, releases the original log item and
> > thus unpins the log.
> > 
> > Since the first transaction pins the tail of the log, this means a
> > finite amount of space in the log is available between the time a
> > quotaoff starts and completes before transaction reservations have
> > to block on the log. While the quotaoff operation itself consumes no
> > further log space, it is possible for other operations in the
> > filesystem to consume the remaining log space before the quotaoff
> > completes. If this occurs, the quotaoff_end transaction also blocks
> > on the log which prevents the release of the log item and the
> > filesystem deadlocks. This has been reproduced via repeated xfs/305
> > iterations on a vm with fairly limited resources.
> > 
> > To avoid a deadlock due to a particularly slow quotaoff operation,
> > allocate the quotaoff_end transaction immediately after the initial
> > quotaoff transaction is committed. Carry a reference to the
> > transaction through xfs_qm_scall_quotaoff() rather than the qoff log
> > item and commit it once the quotaoff completes.
> 
> Hmm... so instead of holding a pinned log item for however long it takes
> to finish quotaoff, we instead hold a transaction.  That does fix the
> problem where we pin the log tail and the head crashes into it, but also
> comes with its own problems, doesn't it?
> 
> In xfs_qm_scall_quotaoff, we call xfs_qm_dqrele_all_inodes, which
> calls xfs_dqrele_inode to call xfs_qm_dqrele on all the inodes in the
> system.  However, if any of those inodes are unlinked and inactive, the
> iput -> xfs_fs_destroy_inode -> xfs_inactive path truncates the inode,
> which itself requires a transaction, so now we're running with nested
> transactions.
> 
> I'm not sure that's necessarily a problem because the outer transaction
> contains only a single qoffend item.  AFAICT I haven't encountered any
> other places in XFS where a single thread runs nested transactions, but
> it still seems suspect to me.
> 

Yeah, good spot. I wasn't expecting to be logging anything else in this
window of time, but it looks like that is possible. We definitely don't
want to be doing nested transactions. I'll have to think about this one
some more.

Note that this was a one off fix for something I reproduced when trying
to reproduce the livelock addressed by the subsequent patches in the
series. It's not a dependency for the subsequent 3 patches.

> Heh, lockdep just spat this at me:
> 

I'm guessing I just didn't have lockdep enabled when I tested this. To
be sure... what is the reproducer for this? Thanks.

Brian

> [19767.048839] =============================================
> [19767.049699] [ INFO: possible recursive locking detected ]
> [19767.050511] 4.11.0-rc4-djw #3 Not tainted
> [19767.051122] ---------------------------------------------
> [19767.051907] xfs_quota/4152 is trying to acquire lock:
> [19767.052617]  (sb_internal){.+.+.+}, at: [<ffffffffa015ee13>] xfs_trans_alloc+0xe3/0x130 [xfs]
> [19767.054151] 
> [19767.054151] but task is already holding lock:
> [19767.055246]  (sb_internal){.+.+.+}, at: [<ffffffffa015ee13>] xfs_trans_alloc+0xe3/0x130 [xfs]
> [19767.056897] 
> [19767.056897] other info that might help us debug this:
> [19767.057980]  Possible unsafe locking scenario:
> [19767.057980] 
> [19767.058967]        CPU0
> [19767.059385]        ----
> [19767.060339]   lock(sb_internal);
> [19767.061187]   lock(sb_internal);
> [19767.061771] 
> [19767.061771]  *** DEADLOCK ***
> [19767.061771] 
> [19767.062818]  May be due to missing lock nesting notation
> [19767.062818] 
> [19767.063967] 3 locks held by xfs_quota/4152:
> [19767.064521]  #0:  (&type->s_umount_key#34){++++++}, at: [<ffffffff812287f0>] __get_super.part.18+0xc0/0xe0
> [19767.065805]  #1:  (&qinf->qi_quotaofflock){+.+.+.}, at: [<ffffffffa0182a53>] xfs_qm_scall_quotaoff+0x53/0x6c0 [xfs]
> [19767.067226]  #2:  (sb_internal){.+.+.+}, at: [<ffffffffa015ee13>] xfs_trans_alloc+0xe3/0x130 [xfs]
> [19767.068543] 
> [19767.068543] stack backtrace:
> [19767.069191] CPU: 2 PID: 4152 Comm: xfs_quota Not tainted 4.11.0-rc4-djw #3
> [19767.070124] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
> [19767.071405] Call Trace:
> [19767.071798]  dump_stack+0x85/0xc2
> [19767.072298]  __lock_acquire+0x164e/0x16a0
> [19767.072866]  ? kvm_sched_clock_read+0x9/0x20
> [19767.073460]  ? sched_clock+0x9/0x10
> [19767.073979]  lock_acquire+0xbf/0x210
> [19767.074532]  ? xfs_trans_alloc+0xe3/0x130 [xfs]
> [19767.075166]  __sb_start_write+0xc0/0x220
> [19767.075782]  ? xfs_trans_alloc+0xe3/0x130 [xfs]
> [19767.076475]  xfs_trans_alloc+0xe3/0x130 [xfs]
> [19767.077138]  xfs_inactive_truncate+0x31/0x130 [xfs]
> [19767.077862]  ? xfs_qm_dqattach+0x16/0x50 [xfs]
> [19767.078524]  xfs_inactive+0x17d/0x270 [xfs]
> [19767.079141]  xfs_fs_destroy_inode+0xb3/0x350 [xfs]
> [19767.079964]  destroy_inode+0x3b/0x60
> [19767.080679]  evict+0x132/0x190
> [19767.081301]  iput+0x243/0x320
> [19767.081961]  xfs_inode_ag_walk+0x2a8/0x690 [xfs]
> [19767.082757]  ? xfs_inode_ag_walk+0x92/0x690 [xfs]
> [19767.083545]  ? xfs_trans_free_dqinfo+0x30/0x30 [xfs]
> [19767.084279]  ? xfs_perag_get+0xa8/0x2b0 [xfs]
> [19767.084753]  ? xfs_perag_get+0xa8/0x2b0 [xfs]
> [19767.085234]  xfs_inode_ag_iterator_flags+0x69/0xa0 [xfs]
> [19767.086251]  ? xfs_trans_free_dqinfo+0x30/0x30 [xfs]
> [19767.087219]  xfs_qm_dqrele_all_inodes+0x36/0x60 [xfs]
> [19767.088431]  xfs_qm_scall_quotaoff+0x100/0x6c0 [xfs]
> [19767.089285]  xfs_quota_disable+0x3d/0x50 [xfs]
> [19767.090138]  SyS_quotactl+0x3ff/0x820
> [19767.090853]  ? SYSC_newstat+0x3b/0x50
> [19767.091501]  ? trace_hardirqs_on_caller+0x129/0x1b0
> [19767.092398]  ? trace_hardirqs_on_thunk+0x1a/0x1c
> [19767.093238]  entry_SYSCALL_64_fastpath+0x1f/0xc2
> [19767.094150] RIP: 0033:0x7f43530a70ca
> [19767.094851] RSP: 002b:00007ffe7570fbe8 EFLAGS: 00000206 ORIG_RAX: 00000000000000b3
> [19767.096078] RAX: ffffffffffffffda RBX: 00007ffe7570fde8 RCX: 00007f43530a70ca
> [19767.097242] RDX: 0000000000000000 RSI: 00000000012d26f0 RDI: 0000000000580202
> [19767.098402] RBP: 0000000000000005 R08: 00007ffe7570fbfc R09: 0000000000005802
> [19767.099645] R10: 00007ffe7570fbfc R11: 0000000000000206 R12: 0000000000401e50
> [19767.100841] R13: 00007ffe7570fde0 R14: 0000000000000000 R15: 0000000000000000
> 
> --D
> 
> > 
> > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > ---
> >  fs/xfs/xfs_qm_syscalls.c | 42 ++++++++++++++++++++++++------------------
> >  1 file changed, 24 insertions(+), 18 deletions(-)
> > 
> > diff --git a/fs/xfs/xfs_qm_syscalls.c b/fs/xfs/xfs_qm_syscalls.c
> > index 475a388..dbb6802 100644
> > --- a/fs/xfs/xfs_qm_syscalls.c
> > +++ b/fs/xfs/xfs_qm_syscalls.c
> > @@ -35,9 +35,11 @@
> >  #include "xfs_trace.h"
> >  #include "xfs_icache.h"
> >  
> > -STATIC int	xfs_qm_log_quotaoff(xfs_mount_t *, xfs_qoff_logitem_t **, uint);
> > -STATIC int	xfs_qm_log_quotaoff_end(xfs_mount_t *, xfs_qoff_logitem_t *,
> > -					uint);
> > +STATIC int	xfs_qm_log_quotaoff(struct xfs_mount *, struct xfs_trans **,
> > +				    uint);
> > +STATIC int	xfs_qm_log_quotaoff_end(struct xfs_mount *,
> > +					struct xfs_qoff_logitem *,
> > +					struct xfs_trans **, uint);
> >  
> >  /*
> >   * Turn off quota accounting and/or enforcement for all udquots and/or
> > @@ -56,7 +58,7 @@ xfs_qm_scall_quotaoff(
> >  	uint			dqtype;
> >  	int			error;
> >  	uint			inactivate_flags;
> > -	xfs_qoff_logitem_t	*qoffstart;
> > +	struct xfs_trans	*qend_tp;
> >  
> >  	/*
> >  	 * No file system can have quotas enabled on disk but not in core.
> > @@ -128,7 +130,7 @@ xfs_qm_scall_quotaoff(
> >  	 * and synchronously. If we fail to write, we should abort the
> >  	 * operation as it cannot be recovered safely if we crash.
> >  	 */
> > -	error = xfs_qm_log_quotaoff(mp, &qoffstart, flags);
> > +	error = xfs_qm_log_quotaoff(mp, &qend_tp, flags);
> >  	if (error)
> >  		goto out_unlock;
> >  
> > @@ -181,7 +183,7 @@ xfs_qm_scall_quotaoff(
> >  	 * So, we have QUOTAOFF start and end logitems; the start
> >  	 * logitem won't get overwritten until the end logitem appears...
> >  	 */
> > -	error = xfs_qm_log_quotaoff_end(mp, qoffstart, flags);
> > +	error = xfs_trans_commit(qend_tp);
> >  	if (error) {
> >  		/* We're screwed now. Shutdown is the only option. */
> >  		xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
> > @@ -556,13 +558,14 @@ xfs_qm_scall_setqlim(
> >  
> >  STATIC int
> >  xfs_qm_log_quotaoff_end(
> > -	xfs_mount_t		*mp,
> > -	xfs_qoff_logitem_t	*startqoff,
> > +	struct xfs_mount	*mp,
> > +	struct xfs_qoff_logitem	*startqoff,
> > +	struct xfs_trans	**tpp,
> >  	uint			flags)
> >  {
> > -	xfs_trans_t		*tp;
> > +	struct xfs_trans	*tp;
> >  	int			error;
> > -	xfs_qoff_logitem_t	*qoffi;
> > +	struct xfs_qoff_logitem	*qoffi;
> >  
> >  	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_qm_equotaoff, 0, 0, 0, &tp);
> >  	if (error)
> > @@ -578,21 +581,22 @@ xfs_qm_log_quotaoff_end(
> >  	 * We don't care about quotoff's performance.
> >  	 */
> >  	xfs_trans_set_sync(tp);
> > -	return xfs_trans_commit(tp);
> > +	*tpp = tp;
> > +	return 0;
> >  }
> >  
> >  
> >  STATIC int
> >  xfs_qm_log_quotaoff(
> > -	xfs_mount_t	       *mp,
> > -	xfs_qoff_logitem_t     **qoffstartp,
> > -	uint		       flags)
> > +	struct xfs_mount	*mp,
> > +	struct xfs_trans	**end_tp,
> > +	uint			flags)
> >  {
> > -	xfs_trans_t	       *tp;
> > +	struct xfs_trans	*tp;
> >  	int			error;
> > -	xfs_qoff_logitem_t     *qoffi;
> > +	struct xfs_qoff_logitem	*qoffi;
> >  
> > -	*qoffstartp = NULL;
> > +	*end_tp = NULL;
> >  
> >  	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_qm_quotaoff, 0, 0, 0, &tp);
> >  	if (error)
> > @@ -617,7 +621,9 @@ xfs_qm_log_quotaoff(
> >  	if (error)
> >  		goto out;
> >  
> > -	*qoffstartp = qoffi;
> > +	error = xfs_qm_log_quotaoff_end(mp, qoffi, end_tp, flags);
> > +	if (error)
> > +		xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
> >  out:
> >  	return error;
> >  }
> > -- 
> > 2.7.4
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/5] xfs: allocate quotaoff transactions up front to avoid log deadlock
  2017-04-27 12:03     ` Brian Foster
@ 2017-04-27 15:47       ` Darrick J. Wong
  0 siblings, 0 replies; 23+ messages in thread
From: Darrick J. Wong @ 2017-04-27 15:47 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs, Eryu Guan, Dave Chinner

On Thu, Apr 27, 2017 at 08:03:27AM -0400, Brian Foster wrote:
> On Wed, Apr 26, 2017 at 02:23:33PM -0700, Darrick J. Wong wrote:
> > On Wed, Feb 15, 2017 at 10:40:44AM -0500, Brian Foster wrote:
> > > The quotaoff operation commits two explicitly synchronous
> > > transactions to correctly handle log recovery of dquots being
> > > modified at the time the quotaoff occurs. The first quotaoff
> > > transaction pins the tail of the log with a qoff logitem in the AIL
> > > to ensure further logged dquots stay ahead of the quotaoff operation
> > > in the log. The second quotaoff_end transaction is committed after
> > > the quotaoff operation completes, releases the original log item and
> > > thus unpins the log.
> > > 
> > > Since the first transaction pins the tail of the log, this means a
> > > finite amount of space in the log is available between the time a
> > > quotaoff starts and completes before transaction reservations have
> > > to block on the log. While the quotaoff operation itself consumes no
> > > further log space, it is possible for other operations in the
> > > filesystem to consume the remaining log space before the quotaoff
> > > completes. If this occurs, the quotaoff_end transaction also blocks
> > > on the log which prevents the release of the log item and the
> > > filesystem deadlocks. This has been reproduced via repeated xfs/305
> > > iterations on a vm with fairly limited resources.
> > > 
> > > To avoid a deadlock due to a particularly slow quotaoff operation,
> > > allocate the quotaoff_end transaction immediately after the initial
> > > quotaoff transaction is committed. Carry a reference to the
> > > transaction through xfs_qm_scall_quotaoff() rather than the qoff log
> > > item and commit it once the quotaoff completes.
> > 
> > Hmm... so instead of holding a pinned log item for however long it takes
> > to finish quotaoff, we instead hold a transaction.  That does fix the
> > problem where we pin the log tail and the head crashes into it, but also
> > comes with its own problems, doesn't it?
> > 
> > In xfs_qm_scall_quotaoff, we call xfs_qm_dqrele_all_inodes, which
> > calls xfs_dqrele_inode to call xfs_qm_dqrele on all the inodes in the
> > system.  However, if any of those inodes are unlinked and inactive, the
> > iput -> xfs_fs_destroy_inode -> xfs_inactive path truncates the inode,
> > which itself requires a transaction, so now we're running with nested
> > transactions.
> > 
> > I'm not sure that's necessarily a problem because the outer transaction
> > contains only a single qoffend item.  AFAICT I haven't encountered any
> > other places in XFS where a single thread runs nested transactions, but
> > it still seems suspect to me.
> > 
> 
> Yeah, good spot. I wasn't expecting to be logging anything else in this
> window of time, but it looks like that is possible. We definitely don't
> want to be doing nested transactions. I'll have to think about this one
> some more.
> 
> Note that this was a one off fix for something I reproduced when trying
> to reproduce the livelock addressed by the subsequent patches in the
> series. It's not a dependency for the subsequent 3 patches.

Oh, ok.  Thank you for pointing that out, as I didn't see a strong
connection between this patch and the next three, and was wondering if
there was one.

> > Heh, lockdep just spat this at me:
> > 
> 
> I'm guessing I just didn't have lockdep enabled when I tested this. To
> be sure... what is the reproducer for this? Thanks.

Running xfs/305 in a loop.

--D

> 
> Brian
> 
> > [19767.048839] =============================================
> > [19767.049699] [ INFO: possible recursive locking detected ]
> > [19767.050511] 4.11.0-rc4-djw #3 Not tainted
> > [19767.051122] ---------------------------------------------
> > [19767.051907] xfs_quota/4152 is trying to acquire lock:
> > [19767.052617]  (sb_internal){.+.+.+}, at: [<ffffffffa015ee13>] xfs_trans_alloc+0xe3/0x130 [xfs]
> > [19767.054151] 
> > [19767.054151] but task is already holding lock:
> > [19767.055246]  (sb_internal){.+.+.+}, at: [<ffffffffa015ee13>] xfs_trans_alloc+0xe3/0x130 [xfs]
> > [19767.056897] 
> > [19767.056897] other info that might help us debug this:
> > [19767.057980]  Possible unsafe locking scenario:
> > [19767.057980] 
> > [19767.058967]        CPU0
> > [19767.059385]        ----
> > [19767.060339]   lock(sb_internal);
> > [19767.061187]   lock(sb_internal);
> > [19767.061771] 
> > [19767.061771]  *** DEADLOCK ***
> > [19767.061771] 
> > [19767.062818]  May be due to missing lock nesting notation
> > [19767.062818] 
> > [19767.063967] 3 locks held by xfs_quota/4152:
> > [19767.064521]  #0:  (&type->s_umount_key#34){++++++}, at: [<ffffffff812287f0>] __get_super.part.18+0xc0/0xe0
> > [19767.065805]  #1:  (&qinf->qi_quotaofflock){+.+.+.}, at: [<ffffffffa0182a53>] xfs_qm_scall_quotaoff+0x53/0x6c0 [xfs]
> > [19767.067226]  #2:  (sb_internal){.+.+.+}, at: [<ffffffffa015ee13>] xfs_trans_alloc+0xe3/0x130 [xfs]
> > [19767.068543] 
> > [19767.068543] stack backtrace:
> > [19767.069191] CPU: 2 PID: 4152 Comm: xfs_quota Not tainted 4.11.0-rc4-djw #3
> > [19767.070124] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
> > [19767.071405] Call Trace:
> > [19767.071798]  dump_stack+0x85/0xc2
> > [19767.072298]  __lock_acquire+0x164e/0x16a0
> > [19767.072866]  ? kvm_sched_clock_read+0x9/0x20
> > [19767.073460]  ? sched_clock+0x9/0x10
> > [19767.073979]  lock_acquire+0xbf/0x210
> > [19767.074532]  ? xfs_trans_alloc+0xe3/0x130 [xfs]
> > [19767.075166]  __sb_start_write+0xc0/0x220
> > [19767.075782]  ? xfs_trans_alloc+0xe3/0x130 [xfs]
> > [19767.076475]  xfs_trans_alloc+0xe3/0x130 [xfs]
> > [19767.077138]  xfs_inactive_truncate+0x31/0x130 [xfs]
> > [19767.077862]  ? xfs_qm_dqattach+0x16/0x50 [xfs]
> > [19767.078524]  xfs_inactive+0x17d/0x270 [xfs]
> > [19767.079141]  xfs_fs_destroy_inode+0xb3/0x350 [xfs]
> > [19767.079964]  destroy_inode+0x3b/0x60
> > [19767.080679]  evict+0x132/0x190
> > [19767.081301]  iput+0x243/0x320
> > [19767.081961]  xfs_inode_ag_walk+0x2a8/0x690 [xfs]
> > [19767.082757]  ? xfs_inode_ag_walk+0x92/0x690 [xfs]
> > [19767.083545]  ? xfs_trans_free_dqinfo+0x30/0x30 [xfs]
> > [19767.084279]  ? xfs_perag_get+0xa8/0x2b0 [xfs]
> > [19767.084753]  ? xfs_perag_get+0xa8/0x2b0 [xfs]
> > [19767.085234]  xfs_inode_ag_iterator_flags+0x69/0xa0 [xfs]
> > [19767.086251]  ? xfs_trans_free_dqinfo+0x30/0x30 [xfs]
> > [19767.087219]  xfs_qm_dqrele_all_inodes+0x36/0x60 [xfs]
> > [19767.088431]  xfs_qm_scall_quotaoff+0x100/0x6c0 [xfs]
> > [19767.089285]  xfs_quota_disable+0x3d/0x50 [xfs]
> > [19767.090138]  SyS_quotactl+0x3ff/0x820
> > [19767.090853]  ? SYSC_newstat+0x3b/0x50
> > [19767.091501]  ? trace_hardirqs_on_caller+0x129/0x1b0
> > [19767.092398]  ? trace_hardirqs_on_thunk+0x1a/0x1c
> > [19767.093238]  entry_SYSCALL_64_fastpath+0x1f/0xc2
> > [19767.094150] RIP: 0033:0x7f43530a70ca
> > [19767.094851] RSP: 002b:00007ffe7570fbe8 EFLAGS: 00000206 ORIG_RAX: 00000000000000b3
> > [19767.096078] RAX: ffffffffffffffda RBX: 00007ffe7570fde8 RCX: 00007f43530a70ca
> > [19767.097242] RDX: 0000000000000000 RSI: 00000000012d26f0 RDI: 0000000000580202
> > [19767.098402] RBP: 0000000000000005 R08: 00007ffe7570fbfc R09: 0000000000005802
> > [19767.099645] R10: 00007ffe7570fbfc R11: 0000000000000206 R12: 0000000000401e50
> > [19767.100841] R13: 00007ffe7570fde0 R14: 0000000000000000 R15: 0000000000000000
> > 
> > --D
> > 
> > > 
> > > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > > ---
> > >  fs/xfs/xfs_qm_syscalls.c | 42 ++++++++++++++++++++++++------------------
> > >  1 file changed, 24 insertions(+), 18 deletions(-)
> > > 
> > > diff --git a/fs/xfs/xfs_qm_syscalls.c b/fs/xfs/xfs_qm_syscalls.c
> > > index 475a388..dbb6802 100644
> > > --- a/fs/xfs/xfs_qm_syscalls.c
> > > +++ b/fs/xfs/xfs_qm_syscalls.c
> > > @@ -35,9 +35,11 @@
> > >  #include "xfs_trace.h"
> > >  #include "xfs_icache.h"
> > >  
> > > -STATIC int	xfs_qm_log_quotaoff(xfs_mount_t *, xfs_qoff_logitem_t **, uint);
> > > -STATIC int	xfs_qm_log_quotaoff_end(xfs_mount_t *, xfs_qoff_logitem_t *,
> > > -					uint);
> > > +STATIC int	xfs_qm_log_quotaoff(struct xfs_mount *, struct xfs_trans **,
> > > +				    uint);
> > > +STATIC int	xfs_qm_log_quotaoff_end(struct xfs_mount *,
> > > +					struct xfs_qoff_logitem *,
> > > +					struct xfs_trans **, uint);
> > >  
> > >  /*
> > >   * Turn off quota accounting and/or enforcement for all udquots and/or
> > > @@ -56,7 +58,7 @@ xfs_qm_scall_quotaoff(
> > >  	uint			dqtype;
> > >  	int			error;
> > >  	uint			inactivate_flags;
> > > -	xfs_qoff_logitem_t	*qoffstart;
> > > +	struct xfs_trans	*qend_tp;
> > >  
> > >  	/*
> > >  	 * No file system can have quotas enabled on disk but not in core.
> > > @@ -128,7 +130,7 @@ xfs_qm_scall_quotaoff(
> > >  	 * and synchronously. If we fail to write, we should abort the
> > >  	 * operation as it cannot be recovered safely if we crash.
> > >  	 */
> > > -	error = xfs_qm_log_quotaoff(mp, &qoffstart, flags);
> > > +	error = xfs_qm_log_quotaoff(mp, &qend_tp, flags);
> > >  	if (error)
> > >  		goto out_unlock;
> > >  
> > > @@ -181,7 +183,7 @@ xfs_qm_scall_quotaoff(
> > >  	 * So, we have QUOTAOFF start and end logitems; the start
> > >  	 * logitem won't get overwritten until the end logitem appears...
> > >  	 */
> > > -	error = xfs_qm_log_quotaoff_end(mp, qoffstart, flags);
> > > +	error = xfs_trans_commit(qend_tp);
> > >  	if (error) {
> > >  		/* We're screwed now. Shutdown is the only option. */
> > >  		xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
> > > @@ -556,13 +558,14 @@ xfs_qm_scall_setqlim(
> > >  
> > >  STATIC int
> > >  xfs_qm_log_quotaoff_end(
> > > -	xfs_mount_t		*mp,
> > > -	xfs_qoff_logitem_t	*startqoff,
> > > +	struct xfs_mount	*mp,
> > > +	struct xfs_qoff_logitem	*startqoff,
> > > +	struct xfs_trans	**tpp,
> > >  	uint			flags)
> > >  {
> > > -	xfs_trans_t		*tp;
> > > +	struct xfs_trans	*tp;
> > >  	int			error;
> > > -	xfs_qoff_logitem_t	*qoffi;
> > > +	struct xfs_qoff_logitem	*qoffi;
> > >  
> > >  	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_qm_equotaoff, 0, 0, 0, &tp);
> > >  	if (error)
> > > @@ -578,21 +581,22 @@ xfs_qm_log_quotaoff_end(
> > >  	 * We don't care about quotoff's performance.
> > >  	 */
> > >  	xfs_trans_set_sync(tp);
> > > -	return xfs_trans_commit(tp);
> > > +	*tpp = tp;
> > > +	return 0;
> > >  }
> > >  
> > >  
> > >  STATIC int
> > >  xfs_qm_log_quotaoff(
> > > -	xfs_mount_t	       *mp,
> > > -	xfs_qoff_logitem_t     **qoffstartp,
> > > -	uint		       flags)
> > > +	struct xfs_mount	*mp,
> > > +	struct xfs_trans	**end_tp,
> > > +	uint			flags)
> > >  {
> > > -	xfs_trans_t	       *tp;
> > > +	struct xfs_trans	*tp;
> > >  	int			error;
> > > -	xfs_qoff_logitem_t     *qoffi;
> > > +	struct xfs_qoff_logitem	*qoffi;
> > >  
> > > -	*qoffstartp = NULL;
> > > +	*end_tp = NULL;
> > >  
> > >  	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_qm_quotaoff, 0, 0, 0, &tp);
> > >  	if (error)
> > > @@ -617,7 +621,9 @@ xfs_qm_log_quotaoff(
> > >  	if (error)
> > >  		goto out;
> > >  
> > > -	*qoffstartp = qoffi;
> > > +	error = xfs_qm_log_quotaoff_end(mp, qoffi, end_tp, flags);
> > > +	if (error)
> > > +		xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
> > >  out:
> > >  	return error;
> > >  }
> > > -- 
> > > 2.7.4
> > > 
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 3/5] xfs: support ability to wait on new inodes
  2017-02-15 15:40 ` [PATCH 3/5] xfs: support ability to wait on new inodes Brian Foster
@ 2017-04-27 21:15   ` Darrick J. Wong
  0 siblings, 0 replies; 23+ messages in thread
From: Darrick J. Wong @ 2017-04-27 21:15 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs, Eryu Guan, Dave Chinner

On Wed, Feb 15, 2017 at 10:40:45AM -0500, Brian Foster wrote:
> Inodes that are inserted into the perag tree but still under
> construction are flagged with the XFS_INEW bit. Most contexts either
> skip such inodes when they are encountered or have the ability to
> handle them.
> 
> The runtime quotaoff sequence introduces a context that must wait
> for construction of such inodes to correctly ensure that all dquots
> in the fs are released. In anticipation of this, support the ability
> to wait on new inodes. Wake the appropriate bit when XFS_INEW is
> cleared.
> 
> Signed-off-by: Brian Foster <bfoster@redhat.com>

Looks ok,
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>

--D

> ---
>  fs/xfs/xfs_icache.c | 5 ++++-
>  fs/xfs/xfs_inode.h  | 4 +++-
>  2 files changed, 7 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> index 7234b97..bb55fd7 100644
> --- a/fs/xfs/xfs_icache.c
> +++ b/fs/xfs/xfs_icache.c
> @@ -366,14 +366,17 @@ xfs_iget_cache_hit(
>  
>  		error = xfs_reinit_inode(mp, inode);
>  		if (error) {
> +			bool wake;
>  			/*
>  			 * Re-initializing the inode failed, and we are in deep
>  			 * trouble.  Try to re-add it to the reclaim list.
>  			 */
>  			rcu_read_lock();
>  			spin_lock(&ip->i_flags_lock);
> -
> +			wake = !!__xfs_iflags_test(ip, XFS_INEW);
>  			ip->i_flags &= ~(XFS_INEW | XFS_IRECLAIM);
> +			if (wake)
> +				wake_up_bit(&ip->i_flags, __XFS_INEW_BIT);
>  			ASSERT(ip->i_flags & XFS_IRECLAIMABLE);
>  			trace_xfs_iget_reclaim_fail(ip);
>  			goto out_error;
> diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
> index 10dcf27..10e89fc 100644
> --- a/fs/xfs/xfs_inode.h
> +++ b/fs/xfs/xfs_inode.h
> @@ -216,7 +216,8 @@ static inline bool xfs_is_reflink_inode(struct xfs_inode *ip)
>  #define XFS_IRECLAIM		(1 << 0) /* started reclaiming this inode */
>  #define XFS_ISTALE		(1 << 1) /* inode has been staled */
>  #define XFS_IRECLAIMABLE	(1 << 2) /* inode can be reclaimed */
> -#define XFS_INEW		(1 << 3) /* inode has just been allocated */
> +#define __XFS_INEW_BIT		3	 /* inode has just been allocated */
> +#define XFS_INEW		(1 << __XFS_INEW_BIT)
>  #define XFS_ITRUNCATED		(1 << 5) /* truncated down so flush-on-close */
>  #define XFS_IDIRTY_RELEASE	(1 << 6) /* dirty release already seen */
>  #define __XFS_IFLOCK_BIT	7	 /* inode is being flushed right now */
> @@ -464,6 +465,7 @@ static inline void xfs_finish_inode_setup(struct xfs_inode *ip)
>  	xfs_iflags_clear(ip, XFS_INEW);
>  	barrier();
>  	unlock_new_inode(VFS_I(ip));
> +	wake_up_bit(&ip->i_flags, __XFS_INEW_BIT);
>  }
>  
>  static inline void xfs_setup_existing_inode(struct xfs_inode *ip)
> -- 
> 2.7.4
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 4/5] xfs: update ag iterator to support wait on new inodes
  2017-02-15 15:40 ` [PATCH 4/5] xfs: update ag iterator to support " Brian Foster
@ 2017-04-27 21:17   ` Darrick J. Wong
  0 siblings, 0 replies; 23+ messages in thread
From: Darrick J. Wong @ 2017-04-27 21:17 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs, Eryu Guan, Dave Chinner

On Wed, Feb 15, 2017 at 10:40:46AM -0500, Brian Foster wrote:
> The AG inode iterator currently skips new inodes as such inodes are
> inserted into the inode radix tree before they are fully
> constructed. Certain contexts require the ability to wait on the
> construction of new inodes, however. The fs-wide dquot release from
> the quotaoff sequence is an example of this.
> 
> Update the AG inode iterator to support the ability to wait on
> inodes flagged with XFS_INEW upon request. Create a new
> xfs_inode_ag_iterator_flags() interface and support a set of
> iteration flags to modify the iteration behavior. When the
> XFS_AGITER_INEW_WAIT flag is set, include XFS_INEW flags in the
> radix tree inode lookup and wait on them before the callback is
> executed.
> 
> Signed-off-by: Brian Foster <bfoster@redhat.com>

Looks ok,
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>

--D

> ---
>  fs/xfs/xfs_icache.c | 53 +++++++++++++++++++++++++++++++++++++++++++++--------
>  fs/xfs/xfs_icache.h |  8 ++++++++
>  2 files changed, 53 insertions(+), 8 deletions(-)
> 
> diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> index bb55fd7..cb45a70 100644
> --- a/fs/xfs/xfs_icache.c
> +++ b/fs/xfs/xfs_icache.c
> @@ -262,6 +262,22 @@ xfs_inode_clear_reclaim_tag(
>  	xfs_perag_clear_reclaim_tag(pag);
>  }
>  
> +static void
> +xfs_inew_wait(
> +	struct xfs_inode	*ip)
> +{
> +	wait_queue_head_t *wq = bit_waitqueue(&ip->i_flags, __XFS_INEW_BIT);
> +	DEFINE_WAIT_BIT(wait, &ip->i_flags, __XFS_INEW_BIT);
> +
> +	do {
> +		prepare_to_wait(wq, &wait.wait, TASK_UNINTERRUPTIBLE);
> +		if (!xfs_iflags_test(ip, XFS_INEW))
> +			break;
> +		schedule();
> +	} while (true);
> +	finish_wait(wq, &wait.wait);
> +}
> +
>  /*
>   * When we recycle a reclaimable inode, we need to re-initialise the VFS inode
>   * part of the structure. This is made more complex by the fact we store
> @@ -626,9 +642,11 @@ xfs_iget(
>  
>  STATIC int
>  xfs_inode_ag_walk_grab(
> -	struct xfs_inode	*ip)
> +	struct xfs_inode	*ip,
> +	int			flags)
>  {
>  	struct inode		*inode = VFS_I(ip);
> +	bool			newinos = !!(flags & XFS_AGITER_INEW_WAIT);
>  
>  	ASSERT(rcu_read_lock_held());
>  
> @@ -646,7 +664,8 @@ xfs_inode_ag_walk_grab(
>  		goto out_unlock_noent;
>  
>  	/* avoid new or reclaimable inodes. Leave for reclaim code to flush */
> -	if (__xfs_iflags_test(ip, XFS_INEW | XFS_IRECLAIMABLE | XFS_IRECLAIM))
> +	if ((!newinos && __xfs_iflags_test(ip, XFS_INEW)) ||
> +	    __xfs_iflags_test(ip, XFS_IRECLAIMABLE | XFS_IRECLAIM))
>  		goto out_unlock_noent;
>  	spin_unlock(&ip->i_flags_lock);
>  
> @@ -674,7 +693,8 @@ xfs_inode_ag_walk(
>  					   void *args),
>  	int			flags,
>  	void			*args,
> -	int			tag)
> +	int			tag,
> +	int			iter_flags)
>  {
>  	uint32_t		first_index;
>  	int			last_error = 0;
> @@ -716,7 +736,7 @@ xfs_inode_ag_walk(
>  		for (i = 0; i < nr_found; i++) {
>  			struct xfs_inode *ip = batch[i];
>  
> -			if (done || xfs_inode_ag_walk_grab(ip))
> +			if (done || xfs_inode_ag_walk_grab(ip, iter_flags))
>  				batch[i] = NULL;
>  
>  			/*
> @@ -744,6 +764,9 @@ xfs_inode_ag_walk(
>  		for (i = 0; i < nr_found; i++) {
>  			if (!batch[i])
>  				continue;
> +			if ((iter_flags & XFS_AGITER_INEW_WAIT) &&
> +			    xfs_iflags_test(batch[i], XFS_INEW))
> +				xfs_inew_wait(batch[i]);
>  			error = execute(batch[i], flags, args);
>  			IRELE(batch[i]);
>  			if (error == -EAGAIN) {
> @@ -823,12 +846,13 @@ xfs_cowblocks_worker(
>  }
>  
>  int
> -xfs_inode_ag_iterator(
> +xfs_inode_ag_iterator_flags(
>  	struct xfs_mount	*mp,
>  	int			(*execute)(struct xfs_inode *ip, int flags,
>  					   void *args),
>  	int			flags,
> -	void			*args)
> +	void			*args,
> +	int			iter_flags)
>  {
>  	struct xfs_perag	*pag;
>  	int			error = 0;
> @@ -838,7 +862,8 @@ xfs_inode_ag_iterator(
>  	ag = 0;
>  	while ((pag = xfs_perag_get(mp, ag))) {
>  		ag = pag->pag_agno + 1;
> -		error = xfs_inode_ag_walk(mp, pag, execute, flags, args, -1);
> +		error = xfs_inode_ag_walk(mp, pag, execute, flags, args, -1,
> +					  iter_flags);
>  		xfs_perag_put(pag);
>  		if (error) {
>  			last_error = error;
> @@ -850,6 +875,17 @@ xfs_inode_ag_iterator(
>  }
>  
>  int
> +xfs_inode_ag_iterator(
> +	struct xfs_mount	*mp,
> +	int			(*execute)(struct xfs_inode *ip, int flags,
> +					   void *args),
> +	int			flags,
> +	void			*args)
> +{
> +	return xfs_inode_ag_iterator_flags(mp, execute, flags, args, 0);
> +}
> +
> +int
>  xfs_inode_ag_iterator_tag(
>  	struct xfs_mount	*mp,
>  	int			(*execute)(struct xfs_inode *ip, int flags,
> @@ -866,7 +902,8 @@ xfs_inode_ag_iterator_tag(
>  	ag = 0;
>  	while ((pag = xfs_perag_get_tag(mp, ag, tag))) {
>  		ag = pag->pag_agno + 1;
> -		error = xfs_inode_ag_walk(mp, pag, execute, flags, args, tag);
> +		error = xfs_inode_ag_walk(mp, pag, execute, flags, args, tag,
> +					  0);
>  		xfs_perag_put(pag);
>  		if (error) {
>  			last_error = error;
> diff --git a/fs/xfs/xfs_icache.h b/fs/xfs/xfs_icache.h
> index 8a7c849..9183f77 100644
> --- a/fs/xfs/xfs_icache.h
> +++ b/fs/xfs/xfs_icache.h
> @@ -48,6 +48,11 @@ struct xfs_eofblocks {
>  #define XFS_IGET_UNTRUSTED	0x2
>  #define XFS_IGET_DONTCACHE	0x4
>  
> +/*
> + * flags for AG inode iterator
> + */
> +#define XFS_AGITER_INEW_WAIT	0x1	/* wait on new inodes */
> +
>  int xfs_iget(struct xfs_mount *mp, struct xfs_trans *tp, xfs_ino_t ino,
>  	     uint flags, uint lock_flags, xfs_inode_t **ipp);
>  
> @@ -79,6 +84,9 @@ void xfs_cowblocks_worker(struct work_struct *);
>  int xfs_inode_ag_iterator(struct xfs_mount *mp,
>  	int (*execute)(struct xfs_inode *ip, int flags, void *args),
>  	int flags, void *args);
> +int xfs_inode_ag_iterator_flags(struct xfs_mount *mp,
> +	int (*execute)(struct xfs_inode *ip, int flags, void *args),
> +	int flags, void *args, int iter_flags);
>  int xfs_inode_ag_iterator_tag(struct xfs_mount *mp,
>  	int (*execute)(struct xfs_inode *ip, int flags, void *args),
>  	int flags, void *args, int tag);
> -- 
> 2.7.4
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 5/5] xfs: wait on new inodes during quotaoff dquot release
  2017-02-15 15:40 ` [PATCH 5/5] xfs: wait on new inodes during quotaoff dquot release Brian Foster
@ 2017-04-27 21:17   ` Darrick J. Wong
  0 siblings, 0 replies; 23+ messages in thread
From: Darrick J. Wong @ 2017-04-27 21:17 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs, Eryu Guan, Dave Chinner

On Wed, Feb 15, 2017 at 10:40:47AM -0500, Brian Foster wrote:
> The quotaoff operation has a race with inode allocation that results
> in a livelock. An inode allocation that occurs before the quota
> status flags are updated acquires the appropriate dquots for the
> inode via xfs_qm_vop_dqalloc(). It then inserts the XFS_INEW inode
> into the perag radix tree, sometime later attaches the dquots to the
> inode and finally clears the XFS_INEW flag. Quotaoff expects to
> release the dquots from all inodes in the filesystem via
> xfs_qm_dqrele_all_inodes(). This invokes the AG inode iterator,
> which skips inodes in the XFS_INEW state because they are not fully
> constructed. If the scan occurs after dquots have been attached to
> an inode, but before XFS_INEW is cleared, the newly allocated inode
> will continue to hold a reference to the applicable dquots. When
> quotaoff invokes xfs_qm_dqpurge_all(), the reference count of those
> dquot(s) remain elevated and the dqpurge scan spins indefinitely.
> 
> To address this problem, update the xfs_qm_dqrele_all_inodes() scan
> to wait on inodes marked on the XFS_INEW state. We wait on the
> inodes explicitly rather than skip and retry to avoid continuous
> retry loops due to a parallel inode allocation workload. Since
> quotaoff updates the quota state flags and uses a synchronous
> transaction before the dqrele scan, and dquots are attached to
> inodes after radix tree insertion iff quota is enabled, one INEW
> waiting pass through the AG guarantees that the scan has processed
> all inodes that could possibly hold dquot references.
> 
> Reported-by: Eryu Guan <eguan@redhat.com>
> Signed-off-by: Brian Foster <bfoster@redhat.com>

Looks ok,
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>

--D

> ---
>  fs/xfs/xfs_qm_syscalls.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/xfs/xfs_qm_syscalls.c b/fs/xfs/xfs_qm_syscalls.c
> index dbb6802..40b7c3f 100644
> --- a/fs/xfs/xfs_qm_syscalls.c
> +++ b/fs/xfs/xfs_qm_syscalls.c
> @@ -765,5 +765,6 @@ xfs_qm_dqrele_all_inodes(
>  	uint		 flags)
>  {
>  	ASSERT(mp->m_quotainfo);
> -	xfs_inode_ag_iterator(mp, xfs_dqrele_inode, flags, NULL);
> +	xfs_inode_ag_iterator_flags(mp, xfs_dqrele_inode, flags, NULL,
> +				    XFS_AGITER_INEW_WAIT);
>  }
> -- 
> 2.7.4
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2017-04-27 21:17 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-02-15 15:40 [PATCH 0/5] xfs: quota deadlock fixes Brian Foster
2017-02-15 15:40 ` [PATCH 1/5] xfs: bypass dquot reclaim to avoid quotacheck deadlock Brian Foster
2017-02-16 22:37   ` Dave Chinner
2017-02-17 18:30     ` Brian Foster
2017-02-17 23:12       ` Dave Chinner
2017-02-18 12:55         ` Brian Foster
2017-02-15 15:40 ` [PATCH 2/5] xfs: allocate quotaoff transactions up front to avoid log deadlock Brian Foster
2017-04-26 21:23   ` Darrick J. Wong
2017-04-27 12:03     ` Brian Foster
2017-04-27 15:47       ` Darrick J. Wong
2017-02-15 15:40 ` [PATCH 3/5] xfs: support ability to wait on new inodes Brian Foster
2017-04-27 21:15   ` Darrick J. Wong
2017-02-15 15:40 ` [PATCH 4/5] xfs: update ag iterator to support " Brian Foster
2017-04-27 21:17   ` Darrick J. Wong
2017-02-15 15:40 ` [PATCH 5/5] xfs: wait on new inodes during quotaoff dquot release Brian Foster
2017-04-27 21:17   ` Darrick J. Wong
2017-02-16  7:42 ` [PATCH 0/5] xfs: quota deadlock fixes Eryu Guan
2017-02-16 12:01   ` Brian Foster
2017-02-17  6:53 ` Eryu Guan
2017-02-17 17:54   ` Brian Foster
2017-02-20  3:52     ` Eryu Guan
2017-02-20 13:25       ` Brian Foster
2017-02-22 15:35         ` Brian Foster

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.