LKML Archive on lore.kernel.org
 help / color / Atom feed
From: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
To: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Will Deacon <will@kernel.org>,
	linux-kernel <linux-kernel@vger.kernel.org>,
	kernel-team <kernel-team@android.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	K <prasad@linux.vnet.ibm.com>,
	Thomas Gleixner <tglx@linutronix.de>,
	Frederic Weisbecker <frederic@kernel.org>,
	Christoph Hellwig <hch@lst.de>,
	Quentin Perret <qperret@google.com>,
	Alexei Starovoitov <ast@kernel.org>,
	Masami Hiramatsu <mhiramat@kernel.org>,
	rostedt <rostedt@goodmis.org>
Subject: Re: [PATCH 0/3] Unexport kallsyms_lookup_name() and kallsyms_on_each_symbol()
Date: Tue, 3 Mar 2020 11:58:47 -0500 (EST)
Message-ID: <34829202.16720.1583254727026.JavaMail.zimbra@efficios.com> (raw)
In-Reply-To: <20200303065735.GA1172591@kroah.com>

----- On Mar 3, 2020, at 1:57 AM, Greg Kroah-Hartman gregkh@linuxfoundation.org wrote:

> On Mon, Mar 02, 2020 at 03:17:07PM -0500, Mathieu Desnoyers wrote:
>> ----- On Mar 2, 2020, at 2:39 PM, Greg Kroah-Hartman gregkh@linuxfoundation.org
>> wrote:
>> 
>> > On Mon, Mar 02, 2020 at 08:36:58PM +0100, Greg Kroah-Hartman wrote:
[...]
>> >> 
>> >> I hate to ask, but why does anyone need block_class?  or disk_name or
>> >> disk_type?  I need to put them behind a driver core namespace or
>> >> something soon...
>> > 
>> 
>> In LTTng, we have a "statedump" which dumps all the relevant state of
>> the kernel at trace start (or when the user asks for it) into the
>> kernel trace. It is used to collect information which helps translating
>> compact numeric data into human-readable information at post-processing.
>> 
>> For block devices, the statedump contains the mapping between the
>> device number and the disk name [1]. It uses the "block_class",
>> "disk_name", and "disk_type" symbols for this. Here is the
>> post-processing output:
>> 
>> [14:48:41.388934812] (+?.?????????) compudjdev lttng_statedump_block_device: {
>> cpu_id = 0 }, { dev = 1048576, diskname = "ram0" }
>> [...]
>> [14:48:41.442548745] (+0.003574998) compudjdev lttng_statedump_block_device: {
>> cpu_id = 0 }, { dev = 1048591, diskname = "ram15" }
>> [14:48:41.446064977] (+0.003516232) compudjdev lttng_statedump_block_device: {
>> cpu_id = 0 }, { dev = 265289728, diskname = "vda" }
>> [14:48:41.449579781] (+0.003514804) compudjdev lttng_statedump_block_device: {
>> cpu_id = 0 }, { dev = 265289729, diskname = "vda1" }
>> [14:48:41.453113808] (+0.003534027) compudjdev lttng_statedump_block_device: {
>> cpu_id = 0 }, { dev = 265289744, diskname = "vdb" }
>> [14:48:41.456640876] (+0.003527068) compudjdev lttng_statedump_block_device: {
>> cpu_id = 0 }, { dev = 265289745, diskname = "vdb1" }
>> 
>> This information is then used in our I/O analyses to show information
>> comprehensible to a user.
> 
> But all of that is availble to you today in userspace, why dig through
> random kernel symbols?
> 
> Look in /sys/dev/block/ or in /sys/block/ for all of that information.
> Is there something that you can only find by the internal symbols that
> is not present today in sysfs?

There is indeed additional information we are getting by iterating
directly on the data structures and emitting tracepoints from within the
kernel which is lost when we copy the data to user-space: the time-stamp
at which the data is observed.

Please note that the statedump approach is applied not only to block
devices, but also to namespaces, thread scheduling, process memory
mappings, file descriptor tables, interrupt handlers, network
interfaces, and cpu topology. Those are more or less long running states
which can change dynamically while a trace is being recorded. Our trace 
post-processing tools model the overall kernel state over time by
reconciling tracepoints tracking all state changes (e.g.
insertions/removals) with the statedump information. The time-stamps
available for both state-change events and statedump events is what
allow us to do this modeling.

In the case of block/genhd.c, we care about the mapping between the
device number and the disk name, which is something which can be changed
dynamically, and is thus valid for a given time-range in the trace.

What we are trying to ensure is to gather all the relevant information
to allow trace analyses to re-create a precise model of the kernel
through time. In the case of genhd, that would be tuples of mapping
between device number and name, which are valid for given time-ranges.

These are recorded by the "lttng_statedump_block_device" event, which
has a timestamp generated by the LTTng tracer, along with the
(device_nr, disk_name) event payload. You will notice from what I stated 
above that we are missing information about disks being 
registered/unregistered later in the trace. Ideally, we would also like
to add tracepoints into register_disk() and del_gendisk() (or more
relevant functions nearby) to trace those state changes. We have those
state-change tracking tracepoints in other subsystems, but unfortunately
not in the block layer.

This information is then used at post-processing to augment the block
events (see include/trace/block.h) to convert the device numbers to
human-readable strings. It is also used to provide human-readable
rows, columns and graph labels when presenting block I/O state over
time, and aggregated summary, on a per disk basis.

If we would instead go through sysfs to export this block device
information, we end up copying the relevant information to user-space,
and only then write the data into a trace buffer. There ends up being a
delay between the point at which the state is observed within the kernel
and the sampling of the time-stamp describing when it occurs, which
introduces many interesting race scenarios where disk
registration/unregistration happens in parallel with a statedump and
block device activity emitting tracepoints into the trace. We lose the
ability to provide a reliable time-range for which the mapping between
(device_nr, disk_name) is valid. Going through uevent also lacks
time-stamp consistency with the block tracepoint events.

Thanks,

Mathieu


-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

      reply index

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-02-21 11:44 Will Deacon
2020-02-21 11:44 ` [PATCH 1/3] samples/hw_breakpoint: Drop HW_BREAKPOINT_R when reporting writes Will Deacon
2020-02-21 14:12   ` Christoph Hellwig
2020-02-21 11:44 ` [PATCH 2/3] samples/hw_breakpoint: Drop use of kallsyms_lookup_name() Will Deacon
2020-02-21 14:13   ` Christoph Hellwig
2020-02-21 14:20     ` Will Deacon
2020-02-21 11:44 ` [PATCH 3/3] kallsyms: Unexport kallsyms_lookup_name() and kallsyms_on_each_symbol() Will Deacon
2020-02-21 11:53   ` Greg Kroah-Hartman
2020-02-21 14:14   ` Christoph Hellwig
2020-02-21 15:11   ` Alexei Starovoitov
2020-02-21 14:27 ` [PATCH 0/3] " Masami Hiramatsu
2020-02-21 14:48   ` Will Deacon
2020-02-21 23:44     ` Masami Hiramatsu
2020-02-21 15:41 ` David Laight
2020-02-21 16:25   ` Quentin Perret
2020-02-25 10:05 ` Miroslav Benes
2020-02-25 12:11   ` Petr Mladek
2020-02-25 15:00     ` Joe Lawrence
2020-02-25 18:01       ` Miroslav Benes
2020-02-26 14:16         ` Joe Lawrence
2020-03-02 19:28 ` Mathieu Desnoyers
2020-03-02 19:36   ` Greg Kroah-Hartman
2020-03-02 19:39     ` Greg Kroah-Hartman
2020-03-02 20:17       ` Mathieu Desnoyers
2020-03-03  6:57         ` Greg Kroah-Hartman
2020-03-03 16:58           ` Mathieu Desnoyers [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=34829202.16720.1583254727026.JavaMail.zimbra@efficios.com \
    --to=mathieu.desnoyers@efficios.com \
    --cc=akpm@linux-foundation.org \
    --cc=ast@kernel.org \
    --cc=frederic@kernel.org \
    --cc=gregkh@linuxfoundation.org \
    --cc=hch@lst.de \
    --cc=kernel-team@android.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mhiramat@kernel.org \
    --cc=prasad@linux.vnet.ibm.com \
    --cc=qperret@google.com \
    --cc=rostedt@goodmis.org \
    --cc=tglx@linutronix.de \
    --cc=will@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

LKML Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/lkml/0 lkml/git/0.git
	git clone --mirror https://lore.kernel.org/lkml/1 lkml/git/1.git
	git clone --mirror https://lore.kernel.org/lkml/2 lkml/git/2.git
	git clone --mirror https://lore.kernel.org/lkml/3 lkml/git/3.git
	git clone --mirror https://lore.kernel.org/lkml/4 lkml/git/4.git
	git clone --mirror https://lore.kernel.org/lkml/5 lkml/git/5.git
	git clone --mirror https://lore.kernel.org/lkml/6 lkml/git/6.git
	git clone --mirror https://lore.kernel.org/lkml/7 lkml/git/7.git
	git clone --mirror https://lore.kernel.org/lkml/8 lkml/git/8.git
	git clone --mirror https://lore.kernel.org/lkml/9 lkml/git/9.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 lkml lkml/ https://lore.kernel.org/lkml \
		linux-kernel@vger.kernel.org
	public-inbox-index lkml

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-kernel


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git