From: Greg Kroah-Hartman <gregkh@linuxfoundation.org> To: linux-kernel@vger.kernel.org Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>, stable@vger.kernel.org, Theodore Tso <tytso@mit.edu>, Chris Mason <clm@fb.com>, Tejun Heo <tj@kernel.org>, Jens Axboe <axboe@kernel.dk>, Andrew Morton <akpm@linux-foundation.org>, Linus Torvalds <torvalds@linux-foundation.org>, Sasha Levin <sashal@kernel.org> Subject: [PATCH 4.4 10/35] memcg: fix a crash in wb_workfn when a device disappears Date: Mon, 22 Feb 2021 13:36:06 +0100 Message-ID: <20210222121018.902082040@linuxfoundation.org> (raw) In-Reply-To: <20210222121013.581198717@linuxfoundation.org> From: Theodore Ts'o <tytso@mit.edu> [ Upstream commit 68f23b89067fdf187763e75a56087550624fdbee ] Without memcg, there is a one-to-one mapping between the bdi and bdi_writeback structures. In this world, things are fairly straightforward; the first thing bdi_unregister() does is to shutdown the bdi_writeback structure (or wb), and part of that writeback ensures that no other work queued against the wb, and that the wb is fully drained. With memcg, however, there is a one-to-many relationship between the bdi and bdi_writeback structures; that is, there are multiple wb objects which can all point to a single bdi. There is a refcount which prevents the bdi object from being released (and hence, unregistered). So in theory, the bdi_unregister() *should* only get called once its refcount goes to zero (bdi_put will drop the refcount, and when it is zero, release_bdi gets called, which calls bdi_unregister). Unfortunately, del_gendisk() in block/gen_hd.c never got the memo about the Brave New memcg World, and calls bdi_unregister directly. It does this without informing the file system, or the memcg code, or anything else. This causes the root wb associated with the bdi to be unregistered, but none of the memcg-specific wb's are shutdown. So when one of these wb's are woken up to do delayed work, they try to dereference their wb->bdi->dev to fetch the device name, but unfortunately bdi->dev is now NULL, thanks to the bdi_unregister() called by del_gendisk(). As a result, *boom*. Fortunately, it looks like the rest of the writeback path is perfectly happy with bdi->dev and bdi->owner being NULL, so the simplest fix is to create a bdi_dev_name() function which can handle bdi->dev being NULL. This also allows us to bulletproof the writeback tracepoints to prevent them from dereferencing a NULL pointer and crashing the kernel if one is tracing with memcg's enabled, and an iSCSI device dies or a USB storage stick is pulled. The most common way of triggering this will be hotremoval of a device while writeback with memcg enabled is going on. It was triggering several times a day in a heavily loaded production environment. Google Bug Id: 145475544 Link: https://lore.kernel.org/r/20191227194829.150110-1-tytso@mit.edu Link: http://lkml.kernel.org/r/20191228005211.163952-1-tytso@mit.edu Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: Chris Mason <clm@fb.com> Cc: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org> --- fs/fs-writeback.c | 2 +- include/linux/backing-dev.h | 10 ++++++++++ include/trace/events/writeback.h | 29 +++++++++++++---------------- mm/backing-dev.c | 1 + 4 files changed, 25 insertions(+), 17 deletions(-) diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index 66a9c9dab8316..7f068330edb67 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -1929,7 +1929,7 @@ void wb_workfn(struct work_struct *work) struct bdi_writeback, dwork); long pages_written; - set_worker_desc("flush-%s", dev_name(wb->bdi->dev)); + set_worker_desc("flush-%s", bdi_dev_name(wb->bdi)); current->flags |= PF_SWAPWRITE; if (likely(!current_is_workqueue_rescuer() || diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h index 361274ce5815f..883ce03191e76 100644 --- a/include/linux/backing-dev.h +++ b/include/linux/backing-dev.h @@ -12,6 +12,7 @@ #include <linux/fs.h> #include <linux/sched.h> #include <linux/blkdev.h> +#include <linux/device.h> #include <linux/writeback.h> #include <linux/blk-cgroup.h> #include <linux/backing-dev-defs.h> @@ -518,4 +519,13 @@ static inline int bdi_rw_congested(struct backing_dev_info *bdi) (1 << WB_async_congested)); } +extern const char *bdi_unknown_name; + +static inline const char *bdi_dev_name(struct backing_dev_info *bdi) +{ + if (!bdi || !bdi->dev) + return bdi_unknown_name; + return dev_name(bdi->dev); +} + #endif /* _LINUX_BACKING_DEV_H */ diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h index 07a912af1dd74..d01217407d6d8 100644 --- a/include/trace/events/writeback.h +++ b/include/trace/events/writeback.h @@ -66,8 +66,8 @@ TRACE_EVENT(writeback_dirty_page, TP_fast_assign( strscpy_pad(__entry->name, - mapping ? dev_name(inode_to_bdi(mapping->host)->dev) : "(unknown)", - 32); + bdi_dev_name(mapping ? inode_to_bdi(mapping->host) : + NULL), 32); __entry->ino = mapping ? mapping->host->i_ino : 0; __entry->index = page->index; ), @@ -96,8 +96,7 @@ DECLARE_EVENT_CLASS(writeback_dirty_inode_template, struct backing_dev_info *bdi = inode_to_bdi(inode); /* may be called for files on pseudo FSes w/ unregistered bdi */ - strscpy_pad(__entry->name, - bdi->dev ? dev_name(bdi->dev) : "(unknown)", 32); + strscpy_pad(__entry->name, bdi_dev_name(bdi), 32); __entry->ino = inode->i_ino; __entry->state = inode->i_state; __entry->flags = flags; @@ -207,7 +206,7 @@ DECLARE_EVENT_CLASS(writeback_write_inode_template, TP_fast_assign( strscpy_pad(__entry->name, - dev_name(inode_to_bdi(inode)->dev), 32); + bdi_dev_name(inode_to_bdi(inode)), 32); __entry->ino = inode->i_ino; __entry->sync_mode = wbc->sync_mode; __trace_wbc_assign_cgroup(__get_str(cgroup), wbc); @@ -250,9 +249,7 @@ DECLARE_EVENT_CLASS(writeback_work_class, __dynamic_array(char, cgroup, __trace_wb_cgroup_size(wb)) ), TP_fast_assign( - strscpy_pad(__entry->name, - wb->bdi->dev ? dev_name(wb->bdi->dev) : - "(unknown)", 32); + strscpy_pad(__entry->name, bdi_dev_name(wb->bdi), 32); __entry->nr_pages = work->nr_pages; __entry->sb_dev = work->sb ? work->sb->s_dev : 0; __entry->sync_mode = work->sync_mode; @@ -305,7 +302,7 @@ DECLARE_EVENT_CLASS(writeback_class, __dynamic_array(char, cgroup, __trace_wb_cgroup_size(wb)) ), TP_fast_assign( - strscpy_pad(__entry->name, dev_name(wb->bdi->dev), 32); + strscpy_pad(__entry->name, bdi_dev_name(wb->bdi), 32); __trace_wb_assign_cgroup(__get_str(cgroup), wb); ), TP_printk("bdi %s: cgroup=%s", @@ -328,7 +325,7 @@ TRACE_EVENT(writeback_bdi_register, __array(char, name, 32) ), TP_fast_assign( - strscpy_pad(__entry->name, dev_name(bdi->dev), 32); + strscpy_pad(__entry->name, bdi_dev_name(bdi), 32); ), TP_printk("bdi %s", __entry->name @@ -353,7 +350,7 @@ DECLARE_EVENT_CLASS(wbc_class, ), TP_fast_assign( - strscpy_pad(__entry->name, dev_name(bdi->dev), 32); + strscpy_pad(__entry->name, bdi_dev_name(bdi), 32); __entry->nr_to_write = wbc->nr_to_write; __entry->pages_skipped = wbc->pages_skipped; __entry->sync_mode = wbc->sync_mode; @@ -404,7 +401,7 @@ TRACE_EVENT(writeback_queue_io, __dynamic_array(char, cgroup, __trace_wb_cgroup_size(wb)) ), TP_fast_assign( - strncpy_pad(__entry->name, dev_name(wb->bdi->dev), 32); + strscpy_pad(__entry->name, bdi_dev_name(wb->bdi), 32); __entry->older = dirtied_before; __entry->age = (jiffies - dirtied_before) * 1000 / HZ; __entry->moved = moved; @@ -489,7 +486,7 @@ TRACE_EVENT(bdi_dirty_ratelimit, ), TP_fast_assign( - strscpy_pad(__entry->bdi, dev_name(wb->bdi->dev), 32); + strscpy_pad(__entry->bdi, bdi_dev_name(wb->bdi), 32); __entry->write_bw = KBps(wb->write_bandwidth); __entry->avg_write_bw = KBps(wb->avg_write_bandwidth); __entry->dirty_rate = KBps(dirty_rate); @@ -554,7 +551,7 @@ TRACE_EVENT(balance_dirty_pages, TP_fast_assign( unsigned long freerun = (thresh + bg_thresh) / 2; - strscpy_pad(__entry->bdi, dev_name(wb->bdi->dev), 32); + strscpy_pad(__entry->bdi, bdi_dev_name(wb->bdi), 32); __entry->limit = global_wb_domain.dirty_limit; __entry->setpoint = (global_wb_domain.dirty_limit + @@ -616,7 +613,7 @@ TRACE_EVENT(writeback_sb_inodes_requeue, TP_fast_assign( strscpy_pad(__entry->name, - dev_name(inode_to_bdi(inode)->dev), 32); + bdi_dev_name(inode_to_bdi(inode)), 32); __entry->ino = inode->i_ino; __entry->state = inode->i_state; __entry->dirtied_when = inode->dirtied_when; @@ -690,7 +687,7 @@ DECLARE_EVENT_CLASS(writeback_single_inode_template, TP_fast_assign( strscpy_pad(__entry->name, - dev_name(inode_to_bdi(inode)->dev), 32); + bdi_dev_name(inode_to_bdi(inode)), 32); __entry->ino = inode->i_ino; __entry->state = inode->i_state; __entry->dirtied_when = inode->dirtied_when; diff --git a/mm/backing-dev.c b/mm/backing-dev.c index 07e3b3b8e8469..f705c58b320b8 100644 --- a/mm/backing-dev.c +++ b/mm/backing-dev.c @@ -21,6 +21,7 @@ struct backing_dev_info noop_backing_dev_info = { EXPORT_SYMBOL_GPL(noop_backing_dev_info); static struct class *bdi_class; +const char *bdi_unknown_name = "(unknown)"; /* * bdi_lock protects updates to bdi_list. bdi_list has RCU reader side -- 2.27.0
next prev parent reply index Thread overview: 43+ messages / expand[flat|nested] mbox.gz Atom feed top 2021-02-22 12:35 [PATCH 4.4 00/35] 4.4.258-rc1 review Greg Kroah-Hartman 2021-02-22 12:35 ` [PATCH 4.4 01/35] tracing: Do not count ftrace events in top level enable output Greg Kroah-Hartman 2021-02-22 12:35 ` [PATCH 4.4 02/35] fgraph: Initialize tracing_graph_pause at task creation Greg Kroah-Hartman 2021-02-22 12:35 ` [PATCH 4.4 03/35] af_key: relax availability checks for skb size calculation Greg Kroah-Hartman 2021-02-22 12:36 ` [PATCH 4.4 04/35] iwlwifi: pcie: add a NULL check in iwl_pcie_txq_unmap Greg Kroah-Hartman 2021-02-25 6:04 ` Nobuhiro Iwamatsu 2021-02-25 8:14 ` Greg Kroah-Hartman 2021-02-25 8:47 ` Nobuhiro Iwamatsu 2021-02-22 12:36 ` [PATCH 4.4 05/35] iwlwifi: mvm: guard against device removal in reprobe Greg Kroah-Hartman 2021-02-22 12:36 ` [PATCH 4.4 06/35] SUNRPC: Move simple_get_bytes and simple_get_netobj into private header Greg Kroah-Hartman 2021-02-22 12:36 ` [PATCH 4.4 07/35] SUNRPC: Handle 0 length opaque XDR object data properly Greg Kroah-Hartman 2021-02-22 12:36 ` [PATCH 4.4 08/35] lib/string: Add strscpy_pad() function Greg Kroah-Hartman 2021-02-22 12:36 ` [PATCH 4.4 09/35] include/trace/events/writeback.h: fix -Wstringop-truncation warnings Greg Kroah-Hartman 2021-02-22 12:36 ` Greg Kroah-Hartman [this message] 2021-02-22 12:36 ` [PATCH 4.4 11/35] squashfs: add more sanity checks in id lookup Greg Kroah-Hartman 2021-02-22 12:36 ` [PATCH 4.4 12/35] squashfs: add more sanity checks in inode lookup Greg Kroah-Hartman 2021-02-22 12:36 ` [PATCH 4.4 13/35] squashfs: add more sanity checks in xattr id lookup Greg Kroah-Hartman 2021-02-22 12:36 ` [PATCH 4.4 14/35] memblock: do not start bottom-up allocations with kernel_end Greg Kroah-Hartman 2021-02-22 12:36 ` [PATCH 4.4 15/35] netfilter: xt_recent: Fix attempt to update deleted entry Greg Kroah-Hartman 2021-02-22 12:36 ` [PATCH 4.4 16/35] h8300: fix PREEMPTION build, TI_PRE_COUNT undefined Greg Kroah-Hartman 2021-02-22 12:36 ` [PATCH 4.4 17/35] usb: dwc3: ulpi: fix checkpatch warning Greg Kroah-Hartman 2021-02-22 12:36 ` [PATCH 4.4 18/35] usb: dwc3: ulpi: Replace CPU-based busyloop with Protocol-based one Greg Kroah-Hartman 2021-02-22 12:36 ` [PATCH 4.4 19/35] net: watchdog: hold device global xmit lock during tx disable Greg Kroah-Hartman 2021-02-22 12:36 ` [PATCH 4.4 20/35] vsock: fix locking in vsock_shutdown() Greg Kroah-Hartman 2021-02-22 12:36 ` [PATCH 4.4 21/35] x86/build: Disable CET instrumentation in the kernel for 32-bit too Greg Kroah-Hartman 2021-02-22 12:36 ` [PATCH 4.4 22/35] trace: Use -mcount-record for dynamic ftrace Greg Kroah-Hartman 2021-02-22 12:36 ` [PATCH 4.4 23/35] tracing: Fix SKIP_STACK_VALIDATION=1 build due to bad merge with -mrecord-mcount Greg Kroah-Hartman 2021-02-22 12:36 ` [PATCH 4.4 24/35] tracing: Avoid calling cc-option -mrecord-mcount for every Makefile Greg Kroah-Hartman 2021-02-22 12:36 ` [PATCH 4.4 25/35] Xen/x86: dont bail early from clear_foreign_p2m_mapping() Greg Kroah-Hartman 2021-02-22 12:36 ` [PATCH 4.4 26/35] Xen/x86: also check kernel mapping in set_foreign_p2m_mapping() Greg Kroah-Hartman 2021-02-22 12:36 ` [PATCH 4.4 27/35] Xen/gntdev: correct dev_bus_addr handling in gntdev_map_grant_pages() Greg Kroah-Hartman 2021-02-22 12:36 ` [PATCH 4.4 28/35] Xen/gntdev: correct error checking " Greg Kroah-Hartman 2021-02-22 12:36 ` [PATCH 4.4 29/35] xen/arm: dont ignore return errors from set_phys_to_machine Greg Kroah-Hartman 2021-02-22 12:36 ` [PATCH 4.4 30/35] xen-blkback: dont "handle" error by BUG() Greg Kroah-Hartman 2021-02-22 12:36 ` [PATCH 4.4 31/35] xen-netback: " Greg Kroah-Hartman 2021-02-22 12:36 ` [PATCH 4.4 32/35] xen-scsiback: " Greg Kroah-Hartman 2021-02-22 12:36 ` [PATCH 4.4 33/35] xen-blkback: fix error handling in xen_blkbk_map() Greg Kroah-Hartman 2021-02-22 12:36 ` [PATCH 4.4 34/35] scsi: qla2xxx: Fix crash during driver load on big endian machines Greg Kroah-Hartman 2021-02-22 12:36 ` [PATCH 4.4 35/35] kvm: check tlbs_dirty directly Greg Kroah-Hartman 2021-02-22 18:37 ` [PATCH 4.4 00/35] 4.4.258-rc1 review Pavel Machek 2021-02-22 21:26 ` Guenter Roeck 2021-02-23 12:13 ` Naresh Kamboju 2021-02-23 21:26 ` Shuah Khan
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=20210222121018.902082040@linuxfoundation.org \ --to=gregkh@linuxfoundation.org \ --cc=akpm@linux-foundation.org \ --cc=axboe@kernel.dk \ --cc=clm@fb.com \ --cc=linux-kernel@vger.kernel.org \ --cc=sashal@kernel.org \ --cc=stable@vger.kernel.org \ --cc=tj@kernel.org \ --cc=torvalds@linux-foundation.org \ --cc=tytso@mit.edu \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: link
LKML Archive on lore.kernel.org Archives are clonable: git clone --mirror https://lore.kernel.org/lkml/0 lkml/git/0.git git clone --mirror https://lore.kernel.org/lkml/1 lkml/git/1.git git clone --mirror https://lore.kernel.org/lkml/2 lkml/git/2.git git clone --mirror https://lore.kernel.org/lkml/3 lkml/git/3.git git clone --mirror https://lore.kernel.org/lkml/4 lkml/git/4.git git clone --mirror https://lore.kernel.org/lkml/5 lkml/git/5.git git clone --mirror https://lore.kernel.org/lkml/6 lkml/git/6.git git clone --mirror https://lore.kernel.org/lkml/7 lkml/git/7.git git clone --mirror https://lore.kernel.org/lkml/8 lkml/git/8.git git clone --mirror https://lore.kernel.org/lkml/9 lkml/git/9.git # If you have public-inbox 1.1+ installed, you may # initialize and index your mirror using the following commands: public-inbox-init -V2 lkml lkml/ https://lore.kernel.org/lkml \ linux-kernel@vger.kernel.org public-inbox-index lkml Example config snippet for mirrors Newsgroup available over NNTP: nntp://nntp.lore.kernel.org/org.kernel.vger.linux-kernel AGPL code for this site: git clone https://public-inbox.org/public-inbox.git