All of lore.kernel.org
 help / color / mirror / Atom feed
From: James Simmons <jsimmons@infradead.org>
To: lustre-devel@lists.lustre.org
Subject: [lustre-devel] [PATCH 20/42] lnet: don't read debugfs lnet stats when shutting down
Date: Mon,  5 Oct 2020 20:05:59 -0400	[thread overview]
Message-ID: <1601942781-24950-21-git-send-email-jsimmons@infradead.org> (raw)
In-Reply-To: <1601942781-24950-1-git-send-email-jsimmons@infradead.org>

A race exist on shutdown with an external application reading
the debugfs file containing lnet stats which causes an kernel
crash.

[  257.192117] BUG: unable to handle kernel paging request at fffffffffffffff0
[  257.194859] IP: [<ffffffffc0bb95c6>] cfs_percpt_number+0x6/0x10 [libcfs]
[  257.196863] PGD 7c14067 PUD 7c16067 PMD 0
[  257.198665] Oops: 0000 [#1] SMP
[  257.200431] Modules linked in: ksocklnd(OE) lnet(OE) libcfs(OE) dm_service_time iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi sunrpc zfs(POE) zunicode(POE) zavl(POE) icp(POE) zcommon(POE) znvpair(POE) spl(OE) ppdev iosf_mbi crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd pcspkr sg e1000 video parport_pc parport i2c_piix4 dm_multipath dm_mod ip_tables xfs libcrc32c sd_mod crc_t10dif crct10dif_generic ata_generic pata_acpi crct10dif_pclmul crct10dif_common ata_piix serio_raw libata [last unloaded: obdclass]
[  257.222895] CPU: 0 PID: 7331 Comm: lctl Tainted: P           OE  ------------   3.10.0-957.el7_lustre.x86_64 #1
[  257.229312] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
[  257.233659] task: ffff9c9fbaf15140 ti: ffff9c9fbabcc000 task.ti: ffff9c9fbabcc000
[  257.238388] RIP: 0010:[<ffffffffc0bb95c6>]  [<ffffffffc0bb95c6>] cfs_percpt_number+0x6/0x10 [libcfs]
[  257.243851] RSP: 0018:ffff9c9fbabcfdb0  EFLAGS: 00010296
[  257.246400] RAX: 0000000000000000 RBX: ffff9c9fba2a5200 RCX: 0000000000000000
[  257.250304] RDX: 0000000000000001 RSI: 00000000ffffffff RDI: 0000000000000000
[  257.253677] RBP: ffff9c9fbabcfdd0 R08: 000000000001f120 R09: ffff9c9fbe001700
[  257.257073] R10: ffffffffc0c376db R11: 0000000000000246 R12: 0000000000000000
[  257.260339] R13: 0000000000000000 R14: 0000000000001000 R15: ffff9c9fba2a5200
[  257.263204] FS:  00007fbdc89c6740(0000) GS:ffff9c9fbfc00000(0000) knlGS:0000000000000000
[  257.266409] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  257.269105] CR2: fffffffffffffff0 CR3: 0000000022e36000 CR4: 00000000000606f0
[  257.272529] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  257.275209] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  257.277936] Call Trace:
[  257.279245]  [<ffffffffc0c0a88b>] ? lnet_counters_get_common+0xeb/0x150 [lnet]
[  257.283071]  [<ffffffffc0c0a95c>] lnet_counters_get+0x6c/0x150 [lnet]
[  257.286224]  [<ffffffffc0c3771b>] __proc_lnet_stats+0xfb/0x810 [lnet]
[  257.288975]  [<ffffffffc0ba6602>] lprocfs_call_handler+0x22/0x50 [libcfs]
[  257.292387]  [<ffffffffc0c36bf5>] proc_lnet_stats+0x25/0x30 [lnet]
[  257.295184]  [<ffffffffc0ba665d>] lnet_debugfs_read+0x2d/0x40 [libcfs]

The solution is to only allow reading of the lnet stats when the
lnet state is LNET_STATE_RUNNING.

WC-bug-id: https://jira.whamcloud.com/browse/LU-11986
Lustre-commit: f53eea15d470c9 ("LU-11986 lnet: don't read debugfs lnet stats when shutting down")
Signed-off-by: James Simmons <jsimmons@infradead.org>
Reviewed-on: https://review.whamcloud.com/39404
Reviewed-by: Chris Horn <chris.horn@hpe.com>
Reviewed-by: Serguei Smirnov <ssmirnov@whamcloud.com>
Reviewed-by: Nathaniel Clark <nclark@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
---
 include/linux/lnet/lib-lnet.h |  2 +-
 net/lnet/lnet/api-ni.c        | 40 +++++++++++++++++++++++++++++-----------
 net/lnet/lnet/router_proc.c   | 17 ++++++-----------
 3 files changed, 36 insertions(+), 23 deletions(-)

diff --git a/include/linux/lnet/lib-lnet.h b/include/linux/lnet/lib-lnet.h
index 6a9ea10..6253c16 100644
--- a/include/linux/lnet/lib-lnet.h
+++ b/include/linux/lnet/lib-lnet.h
@@ -664,7 +664,7 @@ int lnet_delay_rule_list(int pos, struct lnet_fault_attr *attr,
 /** @} lnet_fault_simulation */
 
 void lnet_counters_get_common(struct lnet_counters_common *common);
-void lnet_counters_get(struct lnet_counters *counters);
+int lnet_counters_get(struct lnet_counters *counters);
 void lnet_counters_reset(void);
 
 unsigned int lnet_iov_nob(unsigned int niov, struct kvec *iov);
diff --git a/net/lnet/lnet/api-ni.c b/net/lnet/lnet/api-ni.c
index 0f325ec..af2089e 100644
--- a/net/lnet/lnet/api-ni.c
+++ b/net/lnet/lnet/api-ni.c
@@ -853,16 +853,17 @@ static void lnet_assert_wire_constants(void)
 }
 EXPORT_SYMBOL(lnet_unregister_lnd);
 
-void
-lnet_counters_get_common(struct lnet_counters_common *common)
+static void
+lnet_counters_get_common_locked(struct lnet_counters_common *common)
 {
 	struct lnet_counters *ctr;
 	int i;
 
+	/* FIXME !!! Their is no assert_lnet_net_locked() to ensure this
+	 * actually called under the protection of the lnet_net_lock.
+	 */
 	memset(common, 0, sizeof(*common));
 
-	lnet_net_lock(LNET_LOCK_EX);
-
 	cfs_percpt_for_each(ctr, i, the_lnet.ln_counters) {
 		common->lcc_msgs_max	 += ctr->lct_common.lcc_msgs_max;
 		common->lcc_msgs_alloc   += ctr->lct_common.lcc_msgs_alloc;
@@ -876,23 +877,35 @@ static void lnet_assert_wire_constants(void)
 		common->lcc_route_length += ctr->lct_common.lcc_route_length;
 		common->lcc_drop_length  += ctr->lct_common.lcc_drop_length;
 	}
+}
+
+void
+lnet_counters_get_common(struct lnet_counters_common *common)
+{
+	lnet_net_lock(LNET_LOCK_EX);
+	lnet_counters_get_common_locked(common);
 	lnet_net_unlock(LNET_LOCK_EX);
 }
 EXPORT_SYMBOL(lnet_counters_get_common);
 
-void
+int
 lnet_counters_get(struct lnet_counters *counters)
 {
 	struct lnet_counters *ctr;
 	struct lnet_counters_health *health = &counters->lct_health;
-	int i;
+	int i, rc = 0;
 
 	memset(counters, 0, sizeof(*counters));
 
-	lnet_counters_get_common(&counters->lct_common);
-
 	lnet_net_lock(LNET_LOCK_EX);
 
+	if (the_lnet.ln_state != LNET_STATE_RUNNING) {
+		rc = -ENODEV;
+		goto out_unlock;
+	}
+
+	lnet_counters_get_common_locked(&counters->lct_common);
+
 	cfs_percpt_for_each(ctr, i, the_lnet.ln_counters) {
 		health->lch_rst_alloc += ctr->lct_health.lch_rst_alloc;
 		health->lch_resend_count += ctr->lct_health.lch_resend_count;
@@ -919,7 +932,9 @@ static void lnet_assert_wire_constants(void)
 		health->lch_network_timeout_count +=
 				ctr->lct_health.lch_network_timeout_count;
 	}
+out_unlock:
 	lnet_net_unlock(LNET_LOCK_EX);
+	return rc;
 }
 EXPORT_SYMBOL(lnet_counters_get);
 
@@ -931,9 +946,12 @@ static void lnet_assert_wire_constants(void)
 
 	lnet_net_lock(LNET_LOCK_EX);
 
+	if (the_lnet.ln_state != LNET_STATE_RUNNING)
+		goto avoid_reset;
+
 	cfs_percpt_for_each(counters, i, the_lnet.ln_counters)
 		memset(counters, 0, sizeof(struct lnet_counters));
-
+avoid_reset:
 	lnet_net_unlock(LNET_LOCK_EX);
 }
 
@@ -3680,9 +3698,9 @@ u32 lnet_get_dlc_seq_locked(void)
 			return -EINVAL;
 
 		mutex_lock(&the_lnet.ln_api_mutex);
-		lnet_counters_get(&lnet_stats->st_cntrs);
+		rc = lnet_counters_get(&lnet_stats->st_cntrs);
 		mutex_unlock(&the_lnet.ln_api_mutex);
-		return 0;
+		return rc;
 	}
 
 	case IOC_LIBCFS_CONFIG_RTR:
diff --git a/net/lnet/lnet/router_proc.c b/net/lnet/lnet/router_proc.c
index 7fe8d33..96cc506 100644
--- a/net/lnet/lnet/router_proc.c
+++ b/net/lnet/lnet/router_proc.c
@@ -83,8 +83,7 @@ static int proc_lnet_stats(struct ctl_table *table, int write,
 	size_t nob = *lenp;
 	loff_t pos = *ppos;
 	int len;
-	char *tmpstr;
-	const int tmpsiz = 256; /* 7 %u and 4 %llu */
+	char tmpstr[256]; /* 7 %u and 4 %llu */
 
 	if (write) {
 		lnet_counters_reset();
@@ -96,16 +95,13 @@ static int proc_lnet_stats(struct ctl_table *table, int write,
 	if (!ctrs)
 		return -ENOMEM;
 
-	tmpstr = kmalloc(tmpsiz, GFP_KERNEL);
-	if (!tmpstr) {
-		kfree(ctrs);
-		return -ENOMEM;
-	}
+	rc = lnet_counters_get(ctrs);
+	if (rc)
+		goto out_no_ctrs;
 
-	lnet_counters_get(ctrs);
 	common = ctrs->lct_common;
 
-	len = scnprintf(tmpstr, tmpsiz,
+	len = scnprintf(tmpstr, sizeof(tmpstr),
 			"%u %u %u %u %u %u %u %llu %llu %llu %llu",
 			common.lcc_msgs_alloc, common.lcc_msgs_max,
 			common.lcc_errors,
@@ -119,8 +115,7 @@ static int proc_lnet_stats(struct ctl_table *table, int write,
 	else
 		rc = cfs_trace_copyout_string(buffer, nob,
 					      tmpstr + pos, "\n");
-
-	kfree(tmpstr);
+out_no_ctrs:
 	kfree(ctrs);
 	return rc;
 }
-- 
1.8.3.1

  parent reply	other threads:[~2020-10-06  0:05 UTC|newest]

Thread overview: 43+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-10-06  0:05 [lustre-devel] [PATCH 00/42] lustre: OpenSFS backport for Oct 4 2020 James Simmons
2020-10-06  0:05 ` [lustre-devel] [PATCH 01/42] lustre: ptlrpc: don't require CONFIG_CRYPTO_CRC32 James Simmons
2020-10-06  0:05 ` [lustre-devel] [PATCH 02/42] lustre: dom: lock cancel to drop pages James Simmons
2020-10-06  0:05 ` [lustre-devel] [PATCH 03/42] lustre: sec: use memchr_inv() to check if page is zero James Simmons
2020-10-06  0:05 ` [lustre-devel] [PATCH 04/42] lustre: mdc: fix lovea for replay James Simmons
2020-10-06  0:05 ` [lustre-devel] [PATCH 05/42] lustre: llite: add test to check client deadlock selinux James Simmons
2020-10-06  0:05 ` [lustre-devel] [PATCH 06/42] lnet: use init_wait(), not init_waitqueue_entry() James Simmons
2020-10-06  0:05 ` [lustre-devel] [PATCH 07/42] lustre: lov: make various lov_object.c function static James Simmons
2020-10-06  0:05 ` [lustre-devel] [PATCH 08/42] lustre: llite: return -ENODATA if no default layout James Simmons
2020-10-06  0:05 ` [lustre-devel] [PATCH 09/42] lnet: libcfs: don't save journal_info in dumplog thread James Simmons
2020-10-06  0:05 ` [lustre-devel] [PATCH 10/42] lustre: ldlm: lru code cleanup James Simmons
2020-10-06  0:05 ` [lustre-devel] [PATCH 11/42] lustre: ldlm: cancel LRU improvement James Simmons
2020-10-06  0:05 ` [lustre-devel] [PATCH 12/42] lnet: Do not set preferred NI for MR peer James Simmons
2020-10-06  0:05 ` [lustre-devel] [PATCH 13/42] lustre: ptlrpc: prefer crc32_le() over CryptoAPI James Simmons
2020-10-06  0:05 ` [lustre-devel] [PATCH 14/42] lnet: call event handlers without res_lock James Simmons
2020-10-06  0:05 ` [lustre-devel] [PATCH 15/42] lnet: Conditionally attach rspt in LNetPut & LNetGet James Simmons
2020-10-06  0:05 ` [lustre-devel] [PATCH 16/42] lustre: llite: reuse same cl_dio_aio for one IO James Simmons
2020-10-06  0:05 ` [lustre-devel] [PATCH 17/42] lustre: llite: move iov iter forward by ourself James Simmons
2020-10-06  0:05 ` [lustre-devel] [PATCH 18/42] lustre: llite: report client stats sumsq James Simmons
2020-10-06  0:05 ` [lustre-devel] [PATCH 19/42] lnet: Support checking for MD leaks James Simmons
2020-10-06  0:05 ` James Simmons [this message]
2020-10-06  0:06 ` [lustre-devel] [PATCH 21/42] lnet: Loosen restrictions on LNet Health params James Simmons
2020-10-06  0:06 ` [lustre-devel] [PATCH 22/42] lnet: Fix reference leak in lnet_select_pathway James Simmons
2020-10-06  0:06 ` [lustre-devel] [PATCH 23/42] lustre: llite: prune invalid dentries James Simmons
2020-10-06  0:06 ` [lustre-devel] [PATCH 24/42] lnet: Do not overwrite destination when routing James Simmons
2020-10-06  0:06 ` [lustre-devel] [PATCH 25/42] lustre: lov: don't use inline for operations functions James Simmons
2020-10-06  0:06 ` [lustre-devel] [PATCH 26/42] lustre: osc: don't allow negative grants James Simmons
2020-10-06  0:06 ` [lustre-devel] [PATCH 27/42] lustre: mgc: Use IR for client->MDS/OST connections James Simmons
2020-10-06  0:06 ` [lustre-devel] [PATCH 28/42] lustre: ldlm: don't use a locks without l_ast_data James Simmons
2020-10-06  0:06 ` [lustre-devel] [PATCH 29/42] lustre: lov: discard unused lov_dump_lmm* functions James Simmons
2020-10-06  0:06 ` [lustre-devel] [PATCH 30/42] lustre: lov: guard against class_exp2obd() returning NULL James Simmons
2020-10-06  0:06 ` [lustre-devel] [PATCH 31/42] lustre: clio: don't call aio_complete() in lustre upon errors James Simmons
2020-10-06  0:06 ` [lustre-devel] [PATCH 32/42] lustre: llite: it_lock_bits should be bit-wise tested James Simmons
2020-10-06  0:06 ` [lustre-devel] [PATCH 33/42] lustre: ldlm: control lru_size for extent lock James Simmons
2020-10-06  0:06 ` [lustre-devel] [PATCH 34/42] lustre: ldlm: pool fixes James Simmons
2020-10-06  0:06 ` [lustre-devel] [PATCH 35/42] lustre: ldlm: pool recalc forceful call James Simmons
2020-10-06  0:06 ` [lustre-devel] [PATCH 36/42] lustre: don't take spinlock to read a 'long' James Simmons
2020-10-06  0:06 ` [lustre-devel] [PATCH 37/42] lustre: osc: Do ELC on locks with no OSC object James Simmons
2020-10-06  0:06 ` [lustre-devel] [PATCH 38/42] lnet: deadlock on LNet shutdown James Simmons
2020-10-06  0:06 ` [lustre-devel] [PATCH 39/42] lustre: update version to 2.13.56 James Simmons
2020-10-06  0:06 ` [lustre-devel] [PATCH 40/42] lustre: llite: increase readahead default values James Simmons
2020-10-06  0:06 ` [lustre-devel] [PATCH 41/42] lustre: obdclass: don't initialize obj for zero FID James Simmons
2020-10-06  0:06 ` [lustre-devel] [PATCH 42/42] lustre: obdclass: fixes and improvements for jobid James Simmons

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1601942781-24950-21-git-send-email-jsimmons@infradead.org \
    --to=jsimmons@infradead.org \
    --cc=lustre-devel@lists.lustre.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.