kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* "statsfs" API design
@ 2019-11-06 15:56 Paolo Bonzini
  2019-11-09 15:49 ` Greg Kroah-Hartman
  0 siblings, 1 reply; 6+ messages in thread
From: Paolo Bonzini @ 2019-11-06 15:56 UTC (permalink / raw)
  To: KVM list
  Cc: Steven Rostedt, Christian Borntraeger, Alex Williamson,
	Peter Feiner, Greg Kroah-Hartman, linux-kernel, Linus Torvalds

Hi all,

statsfs is a proposal for a new Linux kernel synthetic filesystem, to be
mounted in /sys/kernel/stats, which exposes subsystem-level statistics
in sysfs.  Reading need not be particularly lightweight, but writing
must be fast.  Therefore, statistics are gathered at a fine-grain level
in order to avoid locking or atomic operations, and then aggregated by
statsfs until the desired granularity.

The first user of statsfs would be KVM, which is currently exposing its
stats in debugfs.  However, debugfs access is now limited by the
security lock down patches, and in addition statsfs aims to be a
more-or-less stable API, hence the idea of making it a separate
filesystem and mount point.

A few people have already expressed interest in this.  Christian
Borntraeger presented on the kvm_stat tool recently at KVM Forum and was
also thinking about using some high-level API in debugfs.  Google has
KVM patches to gather statistics in a binary format; it may be useful to
add this kind of functionality (and some kind of introspection similar
to what tracing does) to statsfs too in the future, but this is
independent from the kernel API.  I'm also CCing Alex Williamson, in
case VFIO is interested in something similar, and Steven Rostedt because
apparently he has enough free time to write poetry in addition to code.

There are just two concepts in statsfs, namely "values" (aka files) and
"sources" (directories).

A value represents a single quantity that is gathered by the statsfs
client.  It could be the number of vmexits of a given kind, the amount
of memory used by some data structure, the length of the longest hash
table chain, or anything like that.

Values are described by a struct like this one:

	struct statsfs_value {
		const char *name;
		enum stat_type type;	/* STAT_TYPE_{BOOL,U64,...} */
		u16 aggr_kind;		/* Bitmask with zero or more of
					 * STAT_AGGR_{MIN,MAX,SUM,...}
					 */
		u16 mode;		/* File mode */
		int offset;		/* Offset from base address
					 * to field containing the value
					 */
	};

As you can see, values are basically integers stored somewhere in a
struct.   The statsfs_value struct also includes information on which
operations (for example sum, min, max, average, count nonzero) it makes
sense to expose when the values are aggregated.

Sources form the bulk of the statsfs API.  They can include two kinds of
elements:

- values as described above.  The common case is to have many values
with the same base address, which are represented by an array of struct
statsfs_value

- subordinate sources

Adding a subordinate source has two effects:

- it creates a subdirectory for each subordinate source

- for each value in the subordinate sources which has aggr_kind != 0,
corresponding values will be created in the parent directory too.  If
multiple subordinate sources are backed by the same array of struct
statsfs_value, values from all those sources will be aggregated.  That
is, statsfs will compute these from the values of all items in the list
and show them in the parent directory.

Writable values can only be written with a value of zero. Writing zero
to an aggregate zeroes all the corresponding values in the subordinate
sources.

Sources are manipulated with these four functions:

	struct statsfs_source *statsfs_source_create(const char *fmt,
						     ...);
	void statsfs_source_add_values(struct statsfs_source *source,
				       struct statsfs_value *stat,
				       int n, void *ptr);
	void statsfs_source_add_subordinate(
					struct statsfs_source *source,
					struct statsfs_source *sub);
	void statsfs_source_remove_subordinate(
					struct statsfs_source *source,
					struct statsfs_source *sub);

Sources are reference counted, and for this reason there is also a pair
of functions in the usual style:

	void statsfs_source_get(struct statsfs_source *);
	void statsfs_source_put(struct statsfs_source *);

Finally,

	void statsfs_source_register(struct statsfs_source *source);

lets you create a toplevel statsfs directory.

As a practical example, KVM's usage of debugfs could be replaced by
something like this:

/* Globals */
	struct statsfs_value vcpu_stats[] = ...;
	struct statsfs_value vm_stats[] = ...;
	static struct statsfs_source *kvm_source;

/* On module creation */
	kvm_source = statsfs_source_create("kvm");
	statsfs_source_register(kvm_source);

/* On VM creation */
	kvm->src = statsfs_source_create("%d-%d\n",
				         task_pid_nr(current), fd);
	statsfs_source_add_values(kvm->src, vm_stats,
				  ARRAY_SIZE(vm_stats),
				  &kvm->stats);
	statsfs_source_add_subordinate(kvm_source, kvm->src);

/* On vCPU creation */
	vcpu_src = statsfs_source_create("vcpu%d\n", vcpu->vcpu_id);
	statsfs_source_add_values(vcpu_src, vcpu_stats,
				  ARRAY_SIZE(vcpu_stats),
				  &vcpu->stats);
	statsfs_source_add_subordinate(kvm->src, vcpu_src);
	/*
	 * No need to keep the vcpu_src around since there's no
	 * separate vCPU deletion event; rely on refcount
	 * exclusively.
	 */
	statsfs_source_put(vcpu_src);

/* On VM deletion */
	statsfs_source_remove_subordinate(kvm_source, kvm->src);
	statsfs_source_put(kvm->src);

/* On KVM exit */
	statsfs_source_put(kvm_source);

How does this look?

Paolo

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: "statsfs" API design
  2019-11-06 15:56 "statsfs" API design Paolo Bonzini
@ 2019-11-09 15:49 ` Greg Kroah-Hartman
  2019-11-10 13:04   ` Paolo Bonzini
  0 siblings, 1 reply; 6+ messages in thread
From: Greg Kroah-Hartman @ 2019-11-09 15:49 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: KVM list, Steven Rostedt, Christian Borntraeger, Alex Williamson,
	Peter Feiner, linux-kernel, Linus Torvalds

On Wed, Nov 06, 2019 at 04:56:25PM +0100, Paolo Bonzini wrote:
> Hi all,
> 
> statsfs is a proposal for a new Linux kernel synthetic filesystem, to be
> mounted in /sys/kernel/stats, which exposes subsystem-level statistics
> in sysfs.  Reading need not be particularly lightweight, but writing
> must be fast.  Therefore, statistics are gathered at a fine-grain level
> in order to avoid locking or atomic operations, and then aggregated by
> statsfs until the desired granularity.

Wait, reading a statistic from userspace can be slow, but writing to it
from userspace has to be fast?  Or do you mean the speed is all for
reading/writing the value within the kernel?

> The first user of statsfs would be KVM, which is currently exposing its
> stats in debugfs.  However, debugfs access is now limited by the
> security lock down patches, and in addition statsfs aims to be a
> more-or-less stable API, hence the idea of making it a separate
> filesystem and mount point.

Nice, I've had people ask about something like this for a while now.
For the most part they just dump stuff in sysfs instead (see the DRM
patches recently for people attempting to do that for debugfs values as
well.)

> A few people have already expressed interest in this.  Christian
> Borntraeger presented on the kvm_stat tool recently at KVM Forum and was
> also thinking about using some high-level API in debugfs.  Google has
> KVM patches to gather statistics in a binary format; it may be useful to
> add this kind of functionality (and some kind of introspection similar
> to what tracing does) to statsfs too in the future, but this is
> independent from the kernel API.  I'm also CCing Alex Williamson, in
> case VFIO is interested in something similar, and Steven Rostedt because
> apparently he has enough free time to write poetry in addition to code.
> 
> There are just two concepts in statsfs, namely "values" (aka files) and
> "sources" (directories).
> 
> A value represents a single quantity that is gathered by the statsfs
> client.  It could be the number of vmexits of a given kind, the amount
> of memory used by some data structure, the length of the longest hash
> table chain, or anything like that.
> 
> Values are described by a struct like this one:
> 
> 	struct statsfs_value {
> 		const char *name;
> 		enum stat_type type;	/* STAT_TYPE_{BOOL,U64,...} */
> 		u16 aggr_kind;		/* Bitmask with zero or more of
> 					 * STAT_AGGR_{MIN,MAX,SUM,...}
> 					 */
> 		u16 mode;		/* File mode */
> 		int offset;		/* Offset from base address
> 					 * to field containing the value
> 					 */
> 	};
> 
> As you can see, values are basically integers stored somewhere in a
> struct.   The statsfs_value struct also includes information on which
> operations (for example sum, min, max, average, count nonzero) it makes
> sense to expose when the values are aggregated.

What can userspace do with that info?

> Sources form the bulk of the statsfs API.  They can include two kinds of
> elements:
> 
> - values as described above.  The common case is to have many values
> with the same base address, which are represented by an array of struct
> statsfs_value
> 
> - subordinate sources
> 
> Adding a subordinate source has two effects:
> 
> - it creates a subdirectory for each subordinate source
> 
> - for each value in the subordinate sources which has aggr_kind != 0,
> corresponding values will be created in the parent directory too.  If
> multiple subordinate sources are backed by the same array of struct
> statsfs_value, values from all those sources will be aggregated.  That
> is, statsfs will compute these from the values of all items in the list
> and show them in the parent directory.
> 
> Writable values can only be written with a value of zero. Writing zero
> to an aggregate zeroes all the corresponding values in the subordinate
> sources.
> 
> Sources are manipulated with these four functions:
> 
> 	struct statsfs_source *statsfs_source_create(const char *fmt,
> 						     ...);
> 	void statsfs_source_add_values(struct statsfs_source *source,
> 				       struct statsfs_value *stat,
> 				       int n, void *ptr);
> 	void statsfs_source_add_subordinate(
> 					struct statsfs_source *source,
> 					struct statsfs_source *sub);
> 	void statsfs_source_remove_subordinate(
> 					struct statsfs_source *source,
> 					struct statsfs_source *sub);
> 
> Sources are reference counted, and for this reason there is also a pair
> of functions in the usual style:
> 
> 	void statsfs_source_get(struct statsfs_source *);
> 	void statsfs_source_put(struct statsfs_source *);
> 
> Finally,
> 
> 	void statsfs_source_register(struct statsfs_source *source);
> 
> lets you create a toplevel statsfs directory.
> 
> As a practical example, KVM's usage of debugfs could be replaced by
> something like this:
> 
> /* Globals */
> 	struct statsfs_value vcpu_stats[] = ...;
> 	struct statsfs_value vm_stats[] = ...;
> 	static struct statsfs_source *kvm_source;
> 
> /* On module creation */
> 	kvm_source = statsfs_source_create("kvm");
> 	statsfs_source_register(kvm_source);
> 
> /* On VM creation */
> 	kvm->src = statsfs_source_create("%d-%d\n",
> 				         task_pid_nr(current), fd);
> 	statsfs_source_add_values(kvm->src, vm_stats,
> 				  ARRAY_SIZE(vm_stats),
> 				  &kvm->stats);
> 	statsfs_source_add_subordinate(kvm_source, kvm->src);
> 
> /* On vCPU creation */
> 	vcpu_src = statsfs_source_create("vcpu%d\n", vcpu->vcpu_id);
> 	statsfs_source_add_values(vcpu_src, vcpu_stats,
> 				  ARRAY_SIZE(vcpu_stats),
> 				  &vcpu->stats);
> 	statsfs_source_add_subordinate(kvm->src, vcpu_src);
> 	/*
> 	 * No need to keep the vcpu_src around since there's no
> 	 * separate vCPU deletion event; rely on refcount
> 	 * exclusively.
> 	 */
> 	statsfs_source_put(vcpu_src);
> 
> /* On VM deletion */
> 	statsfs_source_remove_subordinate(kvm_source, kvm->src);
> 	statsfs_source_put(kvm->src);
> 
> /* On KVM exit */
> 	statsfs_source_put(kvm_source);
> 
> How does this look?

Where does the actual values get changed that get reflected in the
filesystem?

I have some old notes somewhere about what people really want when it
comes to a good "statistics" datatype, that I was thinking of building
off of, but that seems independant of what you are doing here, right?
This is just exporting existing values to userspace in a semi-sane way?

Anyway, I like the idea, but what about how this is exposed to
userspace?  The criticism of sysfs for statistics is that it is too slow
to open/read/close lots of files and tough to get "at this moment in
time these are all the different values" snapshots easily.  How will
this be addressed here?

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: "statsfs" API design
  2019-11-09 15:49 ` Greg Kroah-Hartman
@ 2019-11-10 13:04   ` Paolo Bonzini
  2019-11-26 10:09     ` Greg Kroah-Hartman
  0 siblings, 1 reply; 6+ messages in thread
From: Paolo Bonzini @ 2019-11-10 13:04 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: KVM list, Steven Rostedt, Christian Borntraeger, Alex Williamson,
	Peter Feiner, linux-kernel, Linus Torvalds

On 09/11/19 16:49, Greg Kroah-Hartman wrote:
> On Wed, Nov 06, 2019 at 04:56:25PM +0100, Paolo Bonzini wrote:
>> Hi all,
>>
>> statsfs is a proposal for a new Linux kernel synthetic filesystem, to be
>> mounted in /sys/kernel/stats, which exposes subsystem-level statistics
>> in sysfs.  Reading need not be particularly lightweight, but writing
>> must be fast.  Therefore, statistics are gathered at a fine-grain level
>> in order to avoid locking or atomic operations, and then aggregated by
>> statsfs until the desired granularity.
> 
> Wait, reading a statistic from userspace can be slow, but writing to it
> from userspace has to be fast?  Or do you mean the speed is all for
> reading/writing the value within the kernel?

Reading/writing from the kernel.  Reads from a userspace are a superset
of reading from the kernel, writes from userspace are irrelevant.

[...]

>> As you can see, values are basically integers stored somewhere in a
>> struct.   The statsfs_value struct also includes information on which
>> operations (for example sum, min, max, average, count nonzero) it makes
>> sense to expose when the values are aggregated.
> 
> What can userspace do with that info?

The basic usage is logging.  A turbostat-like tool that is able to use
both debugfs and tracepoints is already in tools/kvm/kvm_stat.

> I have some old notes somewhere about what people really want when it
> comes to a good "statistics" datatype, that I was thinking of building
> off of, but that seems independant of what you are doing here, right?
> This is just exporting existing values to userspace in a semi-sane way?

For KVM yes.  But while I'm at it, I'd like the subsystem to be useful
for others so if you can dig out those notes I can integrate that.

> Anyway, I like the idea, but what about how this is exposed to
> userspace?  The criticism of sysfs for statistics is that it is too slow
> to open/read/close lots of files and tough to get "at this moment in
> time these are all the different values" snapshots easily.  How will
> this be addressed here?

Individual files in sysfs *should* be the first format to export
statsfs, since quick scripts are an important usecase.  However, another
advantage of having a higher-level API is that other ways to access the
stats can be added transparently.

The main requirement for that is self-descriptiveness, blindly passing
structs to userspace is certainly the worst format of all.  But I don't
like the idea of JSON or anything ASCII; that adds overhead to both
production and parsing, for no particular reason.   Tracepoints already
do something like that to export arguments, so perhaps there is room to
reuse code or at least some ideas.  It could be binary sysfs files
(again like tracing) or netlink, I haven't thought about it at all.

Paolo


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: "statsfs" API design
  2019-11-10 13:04   ` Paolo Bonzini
@ 2019-11-26 10:09     ` Greg Kroah-Hartman
  2019-11-26 10:50       ` Paolo Bonzini
  0 siblings, 1 reply; 6+ messages in thread
From: Greg Kroah-Hartman @ 2019-11-26 10:09 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: KVM list, Steven Rostedt, Christian Borntraeger, Alex Williamson,
	Peter Feiner, linux-kernel, Linus Torvalds

On Sun, Nov 10, 2019 at 02:04:58PM +0100, Paolo Bonzini wrote:
> On 09/11/19 16:49, Greg Kroah-Hartman wrote:
> > On Wed, Nov 06, 2019 at 04:56:25PM +0100, Paolo Bonzini wrote:
> >> Hi all,
> >>
> >> statsfs is a proposal for a new Linux kernel synthetic filesystem, to be
> >> mounted in /sys/kernel/stats, which exposes subsystem-level statistics
> >> in sysfs.  Reading need not be particularly lightweight, but writing
> >> must be fast.  Therefore, statistics are gathered at a fine-grain level
> >> in order to avoid locking or atomic operations, and then aggregated by
> >> statsfs until the desired granularity.
> > 
> > Wait, reading a statistic from userspace can be slow, but writing to it
> > from userspace has to be fast?  Or do you mean the speed is all for
> > reading/writing the value within the kernel?
> 
> Reading/writing from the kernel.  Reads from a userspace are a superset
> of reading from the kernel, writes from userspace are irrelevant.
> 
> [...]
> 
> >> As you can see, values are basically integers stored somewhere in a
> >> struct.   The statsfs_value struct also includes information on which
> >> operations (for example sum, min, max, average, count nonzero) it makes
> >> sense to expose when the values are aggregated.
> > 
> > What can userspace do with that info?
> 
> The basic usage is logging.  A turbostat-like tool that is able to use
> both debugfs and tracepoints is already in tools/kvm/kvm_stat.
> 
> > I have some old notes somewhere about what people really want when it
> > comes to a good "statistics" datatype, that I was thinking of building
> > off of, but that seems independant of what you are doing here, right?
> > This is just exporting existing values to userspace in a semi-sane way?
> 
> For KVM yes.  But while I'm at it, I'd like the subsystem to be useful
> for others so if you can dig out those notes I can integrate that.
> 
> > Anyway, I like the idea, but what about how this is exposed to
> > userspace?  The criticism of sysfs for statistics is that it is too slow
> > to open/read/close lots of files and tough to get "at this moment in
> > time these are all the different values" snapshots easily.  How will
> > this be addressed here?
> 
> Individual files in sysfs *should* be the first format to export
> statsfs, since quick scripts are an important usecase.  However, another
> advantage of having a higher-level API is that other ways to access the
> stats can be added transparently.
> 
> The main requirement for that is self-descriptiveness, blindly passing
> structs to userspace is certainly the worst format of all.  But I don't
> like the idea of JSON or anything ASCII; that adds overhead to both
> production and parsing, for no particular reason.   Tracepoints already
> do something like that to export arguments, so perhaps there is room to
> reuse code or at least some ideas.  It could be binary sysfs files
> (again like tracing) or netlink, I haven't thought about it at all.

So I think there are two different things here:
	- a simple data structure for in-kernel users of statistics
	- a way to export statistics to userspace

Now if they both can be solved with the same "solution", wonderful!  But
don't think that you have to do both of these at the same time.

Which one are you trying to solve here, I can't figure it out.  Is it
the second one?

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: "statsfs" API design
  2019-11-26 10:09     ` Greg Kroah-Hartman
@ 2019-11-26 10:50       ` Paolo Bonzini
  2019-11-26 14:18         ` Greg Kroah-Hartman
  0 siblings, 1 reply; 6+ messages in thread
From: Paolo Bonzini @ 2019-11-26 10:50 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: KVM list, Steven Rostedt, Christian Borntraeger, Alex Williamson,
	Peter Feiner, linux-kernel, Linus Torvalds

On 26/11/19 11:09, Greg Kroah-Hartman wrote:
> So I think there are two different things here:
> 	- a simple data structure for in-kernel users of statistics
> 	- a way to export statistics to userspace
> 
> Now if they both can be solved with the same "solution", wonderful!  But
> don't think that you have to do both of these at the same time.
> 
> Which one are you trying to solve here, I can't figure it out.  Is it
> the second one?

I already have the second in KVM using debugfs, but that's not good.  So
I want to do:

- a simple data structure for in-kernel *publishers* of statistics

- a sysfs-based interface to export it to userspace, which looks a lot
like KVM's debugfs-based statistics.

What we don't have to do at the same time, is a new interface to
userspace, one that is more efficient while keeping the self-describing
property that we agree is needed.  That is also planned, but would come
later.

Paolo


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: "statsfs" API design
  2019-11-26 10:50       ` Paolo Bonzini
@ 2019-11-26 14:18         ` Greg Kroah-Hartman
  0 siblings, 0 replies; 6+ messages in thread
From: Greg Kroah-Hartman @ 2019-11-26 14:18 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: KVM list, Steven Rostedt, Christian Borntraeger, Alex Williamson,
	Peter Feiner, linux-kernel, Linus Torvalds

On Tue, Nov 26, 2019 at 11:50:29AM +0100, Paolo Bonzini wrote:
> On 26/11/19 11:09, Greg Kroah-Hartman wrote:
> > So I think there are two different things here:
> > 	- a simple data structure for in-kernel users of statistics
> > 	- a way to export statistics to userspace
> > 
> > Now if they both can be solved with the same "solution", wonderful!  But
> > don't think that you have to do both of these at the same time.
> > 
> > Which one are you trying to solve here, I can't figure it out.  Is it
> > the second one?
> 
> I already have the second in KVM using debugfs, but that's not good.  So
> I want to do:
> 
> - a simple data structure for in-kernel *publishers* of statistics
> 
> - a sysfs-based interface to export it to userspace, which looks a lot
> like KVM's debugfs-based statistics.
> 
> What we don't have to do at the same time, is a new interface to
> userspace, one that is more efficient while keeping the self-describing
> property that we agree is needed.  That is also planned, but would come
> later.

Ok, I have no objection to tying these to sysfs entries for now, but we
should be careful in how you handle the sysfs hierarchy to not make it
too complex or difficult to parse from userspace (lots of little files
does not make gathering stats easy as was already pointed out.)

for in-kernel stats, here's a note that I had that I finally found from
a different kernel developer saying what they wanted to see in something
like this:
	(Accurate) statistics counters need RMW ops, don't need memory
	ordering, usually can't be locked against writes, and may not
	care about wrapping.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2019-11-26 14:18 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-11-06 15:56 "statsfs" API design Paolo Bonzini
2019-11-09 15:49 ` Greg Kroah-Hartman
2019-11-10 13:04   ` Paolo Bonzini
2019-11-26 10:09     ` Greg Kroah-Hartman
2019-11-26 10:50       ` Paolo Bonzini
2019-11-26 14:18         ` Greg Kroah-Hartman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).