linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: "statsfs" API design
@ 2019-11-09 18:44 Alexey Dobriyan
  2019-11-10  9:14 ` Greg KH
  0 siblings, 1 reply; 15+ messages in thread
From: Alexey Dobriyan @ 2019-11-09 18:44 UTC (permalink / raw)
  To: pbonzini, gregkh; +Cc: linux-kernel

> statsfs is a proposal for a new Linux kernel synthetic filesystem,
> to be mounted in /sys/kernel/stats

I think /proc experiment teaches pretty convincingly that dressing
things into a filesystem can be done but ultimately is a stupid idea.
It adds so much overhead for small-to-medium systems.

> The first user of statsfs would be KVM, which is currently exposing
> its stats in debugfs

> Google has KVM patches to gather statistics in a binary format

Which is a right thing to do.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: "statsfs" API design
  2019-11-09 18:44 "statsfs" API design Alexey Dobriyan
@ 2019-11-10  9:14 ` Greg KH
  2019-11-10 10:09   ` Brian Masney
  2019-11-10 15:34   ` Alexey Dobriyan
  0 siblings, 2 replies; 15+ messages in thread
From: Greg KH @ 2019-11-10  9:14 UTC (permalink / raw)
  To: Alexey Dobriyan; +Cc: pbonzini, linux-kernel

On Sat, Nov 09, 2019 at 09:44:41PM +0300, Alexey Dobriyan wrote:
> > statsfs is a proposal for a new Linux kernel synthetic filesystem,
> > to be mounted in /sys/kernel/stats
> 
> I think /proc experiment teaches pretty convincingly that dressing
> things into a filesystem can be done but ultimately is a stupid idea.
> It adds so much overhead for small-to-medium systems.
> 
> > The first user of statsfs would be KVM, which is currently exposing
> > its stats in debugfs
> 
> > Google has KVM patches to gather statistics in a binary format
> 
> Which is a right thing to do.

It's always "simpler" to just take binary data and suck it in.  That
works for a year or so until another value needs to be supported.  Or
removed.  Or features are backported.

The reason text values in individual files work is they are "self
describable" and "self discoverable".  You "know" what the value is and
that it is supported because the file is there or not.  With binary
values in a single file you do not know any of that.

So you need some way of describing the data to userspace in order for
this to work properly over the next 20+ years.

Maybe something like varlink which describes the data coming from the
kernel in an easy-to-handle format?  Or something else, but just using
blobs does not work over the long-term, sorry.

greg k-h

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: "statsfs" API design
  2019-11-10  9:14 ` Greg KH
@ 2019-11-10 10:09   ` Brian Masney
  2019-11-10 10:14     ` Greg KH
  2019-11-10 15:34   ` Alexey Dobriyan
  1 sibling, 1 reply; 15+ messages in thread
From: Brian Masney @ 2019-11-10 10:09 UTC (permalink / raw)
  To: Greg KH; +Cc: Alexey Dobriyan, pbonzini, linux-kernel

On Sun, Nov 10, 2019 at 10:14:35AM +0100, Greg KH wrote:
> On Sat, Nov 09, 2019 at 09:44:41PM +0300, Alexey Dobriyan wrote:
> > > statsfs is a proposal for a new Linux kernel synthetic filesystem,
> > > to be mounted in /sys/kernel/stats
> > 
> > I think /proc experiment teaches pretty convincingly that dressing
> > things into a filesystem can be done but ultimately is a stupid idea.
> > It adds so much overhead for small-to-medium systems.
> > 
> > > The first user of statsfs would be KVM, which is currently exposing
> > > its stats in debugfs
> > 
> > > Google has KVM patches to gather statistics in a binary format
> > 
> > Which is a right thing to do.
> 
> It's always "simpler" to just take binary data and suck it in.  That
> works for a year or so until another value needs to be supported.  Or
> removed.  Or features are backported.
> 
> The reason text values in individual files work is they are "self
> describable" and "self discoverable".  You "know" what the value is and
> that it is supported because the file is there or not.  With binary
> values in a single file you do not know any of that.
> 
> So you need some way of describing the data to userspace in order for
> this to work properly over the next 20+ years.
> 
> Maybe something like varlink which describes the data coming from the
> kernel in an easy-to-handle format?  Or something else, but just using
> blobs does not work over the long-term, sorry.

What about using a text format like YAML? Here's some benefits:

  - The fields are self describing based on the key name.
  - New fields can be easily added without breaking compatibility.
  - Allows for a script to easily parse the contents while keeping
    human readability.
  - Would work for systems that run busybox as their userspace without
    having to install additional tools.
  - Allows for a nested data structure.

The downside is that the output would be larger than a binary interface
but it's more maintainable in my opinion.

Brian

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: "statsfs" API design
  2019-11-10 10:09   ` Brian Masney
@ 2019-11-10 10:14     ` Greg KH
  2019-11-10 10:19       ` Greg KH
  0 siblings, 1 reply; 15+ messages in thread
From: Greg KH @ 2019-11-10 10:14 UTC (permalink / raw)
  To: Brian Masney; +Cc: Alexey Dobriyan, pbonzini, linux-kernel

On Sun, Nov 10, 2019 at 05:09:13AM -0500, Brian Masney wrote:
> On Sun, Nov 10, 2019 at 10:14:35AM +0100, Greg KH wrote:
> > On Sat, Nov 09, 2019 at 09:44:41PM +0300, Alexey Dobriyan wrote:
> > > > statsfs is a proposal for a new Linux kernel synthetic filesystem,
> > > > to be mounted in /sys/kernel/stats
> > > 
> > > I think /proc experiment teaches pretty convincingly that dressing
> > > things into a filesystem can be done but ultimately is a stupid idea.
> > > It adds so much overhead for small-to-medium systems.
> > > 
> > > > The first user of statsfs would be KVM, which is currently exposing
> > > > its stats in debugfs
> > > 
> > > > Google has KVM patches to gather statistics in a binary format
> > > 
> > > Which is a right thing to do.
> > 
> > It's always "simpler" to just take binary data and suck it in.  That
> > works for a year or so until another value needs to be supported.  Or
> > removed.  Or features are backported.
> > 
> > The reason text values in individual files work is they are "self
> > describable" and "self discoverable".  You "know" what the value is and
> > that it is supported because the file is there or not.  With binary
> > values in a single file you do not know any of that.
> > 
> > So you need some way of describing the data to userspace in order for
> > this to work properly over the next 20+ years.
> > 
> > Maybe something like varlink which describes the data coming from the
> > kernel in an easy-to-handle format?  Or something else, but just using
> > blobs does not work over the long-term, sorry.
> 
> What about using a text format like YAML? Here's some benefits:
> 
>   - The fields are self describing based on the key name.
>   - New fields can be easily added without breaking compatibility.
>   - Allows for a script to easily parse the contents while keeping
>     human readability.
>   - Would work for systems that run busybox as their userspace without
>     having to install additional tools.
>   - Allows for a nested data structure.

varlink was created to solve the issues that people have had with YAML
over time, so you might want to look into that :)
	https://varlink.org/

> The downside is that the output would be larger than a binary interface
> but it's more maintainable in my opinion.

binary interfaces are unmaintainable over time, especially when you do
not control both sides of the interface (unlike Google and their use of
this for KVM stats.)

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: "statsfs" API design
  2019-11-10 10:14     ` Greg KH
@ 2019-11-10 10:19       ` Greg KH
  0 siblings, 0 replies; 15+ messages in thread
From: Greg KH @ 2019-11-10 10:19 UTC (permalink / raw)
  To: Brian Masney; +Cc: Alexey Dobriyan, pbonzini, linux-kernel

On Sun, Nov 10, 2019 at 11:14:18AM +0100, Greg KH wrote:
> On Sun, Nov 10, 2019 at 05:09:13AM -0500, Brian Masney wrote:
> > On Sun, Nov 10, 2019 at 10:14:35AM +0100, Greg KH wrote:
> > > On Sat, Nov 09, 2019 at 09:44:41PM +0300, Alexey Dobriyan wrote:
> > > > > statsfs is a proposal for a new Linux kernel synthetic filesystem,
> > > > > to be mounted in /sys/kernel/stats
> > > > 
> > > > I think /proc experiment teaches pretty convincingly that dressing
> > > > things into a filesystem can be done but ultimately is a stupid idea.
> > > > It adds so much overhead for small-to-medium systems.
> > > > 
> > > > > The first user of statsfs would be KVM, which is currently exposing
> > > > > its stats in debugfs
> > > > 
> > > > > Google has KVM patches to gather statistics in a binary format
> > > > 
> > > > Which is a right thing to do.
> > > 
> > > It's always "simpler" to just take binary data and suck it in.  That
> > > works for a year or so until another value needs to be supported.  Or
> > > removed.  Or features are backported.
> > > 
> > > The reason text values in individual files work is they are "self
> > > describable" and "self discoverable".  You "know" what the value is and
> > > that it is supported because the file is there or not.  With binary
> > > values in a single file you do not know any of that.
> > > 
> > > So you need some way of describing the data to userspace in order for
> > > this to work properly over the next 20+ years.
> > > 
> > > Maybe something like varlink which describes the data coming from the
> > > kernel in an easy-to-handle format?  Or something else, but just using
> > > blobs does not work over the long-term, sorry.
> > 
> > What about using a text format like YAML? Here's some benefits:
> > 
> >   - The fields are self describing based on the key name.
> >   - New fields can be easily added without breaking compatibility.
> >   - Allows for a script to easily parse the contents while keeping
> >     human readability.
> >   - Would work for systems that run busybox as their userspace without
> >     having to install additional tools.
> >   - Allows for a nested data structure.
> 
> varlink was created to solve the issues that people have had with YAML
> over time, so you might want to look into that :)
> 	https://varlink.org/

Well, varlink really is JSON, but at least there's example code for
userspace and kernelspace to pull from to knock up tests for people who
want to look into it.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: "statsfs" API design
  2019-11-10  9:14 ` Greg KH
  2019-11-10 10:09   ` Brian Masney
@ 2019-11-10 15:34   ` Alexey Dobriyan
  2019-11-10 20:58     ` Paolo Bonzini
  1 sibling, 1 reply; 15+ messages in thread
From: Alexey Dobriyan @ 2019-11-10 15:34 UTC (permalink / raw)
  To: Greg KH; +Cc: pbonzini, linux-kernel

On Sun, Nov 10, 2019 at 10:14:35AM +0100, Greg KH wrote:
> On Sat, Nov 09, 2019 at 09:44:41PM +0300, Alexey Dobriyan wrote:
> > > statsfs is a proposal for a new Linux kernel synthetic filesystem,
> > > to be mounted in /sys/kernel/stats
> > 
> > I think /proc experiment teaches pretty convincingly that dressing
> > things into a filesystem can be done but ultimately is a stupid idea.
> > It adds so much overhead for small-to-medium systems.
> > 
> > > The first user of statsfs would be KVM, which is currently exposing
> > > its stats in debugfs
> > 
> > > Google has KVM patches to gather statistics in a binary format
> > 
> > Which is a right thing to do.
> 
> It's always "simpler" to just take binary data and suck it in.

Faster too!

> That works for a year or so until another value needs to be supported.
> Or removed.  Or features are backported.
> 
> The reason text values in individual files work is they are "self
> describable" and "self discoverable".

Untrue. Applications always knows what the data means, by definition:

	"0x42"

What is this? 4-char NUL-terminated string? Or an integer 66? Or a
4-byte nonce blob for some kind of crypto algorithm.

In the other direction: describe every field of /proc/*/stat file
without looking to the manpage:

$ cat /proc/self/stat
5349 (cat) R 5342 5349 5342 34826 5349 4210688 91 0 0 0 0 0 0 0 20 0 1 0 864988 9183232 184 18446744073709551615 94352028622848 94352028651936 140733810522864 0 0 0 0 0 0 0 0 0 17 5 0 0 0 0 0 94352030751824 94352030753376 94352060055552 140733810527527 140733810527547 140733810527547 140733810532335 0

> You "know" what the value is and
> that it is supported because the file is there or not.  With binary
> values in a single file you do not know any of that.

You _always_ know that.

> So you need some way of describing the data to userspace in order for
> this to work properly over the next 20+ years.
> 
> Maybe something like varlink which describes the data coming from the
> kernel in an easy-to-handle format?  Or something else, but just using
> blobs does not work over the long-term, sorry.

Text doesn't work either. Why do you think /proc/*/maps have 1 space
character at the end of anon mappings? Because someone fucked up.
Here is how people use /proc in the field:

	https://stackoverflow.com/questions/3596781/how-to-detect-if-the-current-process-is-being-run-by-gdb

	open /proc/*/status
	read
	strstr("TracerPid:")

I'd humbly suggest to define the minimum amount of work for the task:
* some kind of percpu loop to gather stats
* some kind of accumulation code, possibly with min/max/avg
* write clear data
* copy_to_user

and realise that everything alse is a waste of electricity, namely,

* pathname allocation (4KB)
* VFS '/' split, lookups (/sys/kernel/.../" means 3+ lookups
* 192 bytes for each dentry
* 550+ bytes per inode
* 3 system calls per act of gathering statistics
	userspace will be written in the most stupid way possible
	without openat() etc
* userspace snprintf() for pathname
* kernel space snprintf() somewhere
* multiple copying inside kernel (vsnprintf.c)
* general inability for userspace to estimate the amount of data in decimal
  (nobody does that), so nicely sized buffers of 4K or 1K or 16KB (bash)
  will be used which is a waste.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: "statsfs" API design
  2019-11-10 15:34   ` Alexey Dobriyan
@ 2019-11-10 20:58     ` Paolo Bonzini
  2019-11-11 20:40       ` Alexey Dobriyan
  0 siblings, 1 reply; 15+ messages in thread
From: Paolo Bonzini @ 2019-11-10 20:58 UTC (permalink / raw)
  To: Alexey Dobriyan, Greg KH; +Cc: linux-kernel

On 10/11/19 16:34, Alexey Dobriyan wrote:
> In the other direction: describe every field of /proc/*/stat file
> without looking to the manpage:
> 
> $ cat /proc/self/stat
> 5349 (cat) R 5342 5349 5342 34826 5349 4210688 91 0 0 0 0 0 0 0 20 0 1 0 864988 9183232 184 18446744073709551615 94352028622848 94352028651936 140733810522864 0 0 0 0 0 0 0 0 0 17 5 0 0 0 0 0 94352030751824 94352030753376 94352060055552 140733810527527 140733810527547 140733810527547 140733810532335 0

That's why this is not what I am proposing, and also not what Greg has
mentioned.

> and realise that everything alse is a waste of electricity, namely,
> 
> * pathname allocation (4KB)
> * VFS '/' split, lookups (/sys/kernel/.../" means 3+ lookups
> * 192 bytes for each dentry
> * 550+ bytes per inode
> * 3 system calls per act of gathering statistics
> 	userspace will be written in the most stupid way possible
> 	without openat() etc
> * userspace snprintf() for pathname
> * kernel space snprintf() somewhere
> * multiple copying inside kernel (vsnprintf.c)
> * general inability for userspace to estimate the amount of data in decimal
>   (nobody does that), so nicely sized buffers of 4K or 1K or 16KB (bash)
>   will be used which is a waste.

Yeah, all of this is true but I know how much I use
/sys/kernel/debug/kvm so backwards-compatibility with it is certainly a
requirement for stats.  Good thing, having a high-level stats API lets
you also design something that targets different usecases than just
quick "cat" or "watch".  The somewhat wasteful sysfs interface to
statsfs can even be hidden behind a kconfig symbol once there is an
alternative.  It also makes it possible to create inodes on demand if
someone is so inclined.

So the good thing is that despite the disagreements, this can be
considered an argument in favor of statsfs, and we agree on that. :)

Thanks,

Paolo


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: "statsfs" API design
  2019-11-10 20:58     ` Paolo Bonzini
@ 2019-11-11 20:40       ` Alexey Dobriyan
  2019-11-26 10:07         ` Greg KH
  0 siblings, 1 reply; 15+ messages in thread
From: Alexey Dobriyan @ 2019-11-11 20:40 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: Greg KH, linux-kernel

On Sun, Nov 10, 2019 at 09:58:14PM +0100, Paolo Bonzini wrote:
> On 10/11/19 16:34, Alexey Dobriyan wrote:
> > In the other direction: describe every field of /proc/*/stat file
> > without looking to the manpage:
> > 
> > $ cat /proc/self/stat
> > 5349 (cat) R 5342 5349 5342 34826 5349 4210688 91 0 0 0 0 0 0 0 20 0 1 0 864988 9183232 184 18446744073709551615 94352028622848 94352028651936 140733810522864 0 0 0 0 0 0 0 0 0 17 5 0 0 0 0 0 94352030751824 94352030753376 94352060055552 140733810527527 140733810527547 140733810527547 140733810532335 0
> 
> That's why this is not what I am proposing, and also not what Greg has
> mentioned.

The argument was that text is somehow superior to binary. Experiment shows
that userspace can make a mess of both modes therefore preferring one
to another should be based on something else (preferably objective).

/proc have these two problems:
First, noticeably slow:

	https://news.ycombinator.com/item?id=21414882

Second, overinstantiating inodes and dentries:

	https://lore.kernel.org/lkml/20180424022106.16952-1-jeffm@suse.com/

statfs maybe never get to that level but it is not hard to see what lies
at the end of the tunnel.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: "statsfs" API design
  2019-11-11 20:40       ` Alexey Dobriyan
@ 2019-11-26 10:07         ` Greg KH
  0 siblings, 0 replies; 15+ messages in thread
From: Greg KH @ 2019-11-26 10:07 UTC (permalink / raw)
  To: Alexey Dobriyan; +Cc: Paolo Bonzini, linux-kernel

On Mon, Nov 11, 2019 at 11:40:32PM +0300, Alexey Dobriyan wrote:
> On Sun, Nov 10, 2019 at 09:58:14PM +0100, Paolo Bonzini wrote:
> > On 10/11/19 16:34, Alexey Dobriyan wrote:
> > > In the other direction: describe every field of /proc/*/stat file
> > > without looking to the manpage:
> > > 
> > > $ cat /proc/self/stat
> > > 5349 (cat) R 5342 5349 5342 34826 5349 4210688 91 0 0 0 0 0 0 0 20 0 1 0 864988 9183232 184 18446744073709551615 94352028622848 94352028651936 140733810522864 0 0 0 0 0 0 0 0 0 17 5 0 0 0 0 0 94352030751824 94352030753376 94352060055552 140733810527527 140733810527547 140733810527547 140733810532335 0
> > 
> > That's why this is not what I am proposing, and also not what Greg has
> > mentioned.
> 
> The argument was that text is somehow superior to binary. Experiment shows
> that userspace can make a mess of both modes therefore preferring one
> to another should be based on something else (preferably objective).

No, that was NOT what my argument was.

My argument is that you have to have self-describing data somehow,
otherwise you will always get out of sync with what the kernel is
exporting and what userspace expects.  The above crazy procfs example
proves my point very well :)

sysfs "solves" this problem by requiring one value per file.  If the
file is not present, userspace "knows" that the value isn't there at
all.  It a simple solution for the problem that procfs has with multiple
values in single files and is why we did it that way.

Now that's not to say this is the only way to solve the issue here, it's
just the one that we decided to use at the time.  statfs can choose to
do it differently, but it can NOT just ignore the problem here,
otherwise we end up with the old procfs problems as you show above.

> /proc have these two problems:
> First, noticeably slow:
> 
> 	https://news.ycombinator.com/item?id=21414882

Yes, opening thousands of files is "slow", that's known :)

> Second, overinstantiating inodes and dentries:
> 
> 	https://lore.kernel.org/lkml/20180424022106.16952-1-jeffm@suse.com/

Exporting hundreds of thousands of files, what could go wrong? :)

> statfs maybe never get to that level but it is not hard to see what lies
> at the end of the tunnel.

Those are nice things to think about while doing this, but I think we
are a long ways off from these types of issues.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: "statsfs" API design
  2019-11-26 10:50       ` Paolo Bonzini
@ 2019-11-26 14:18         ` Greg Kroah-Hartman
  0 siblings, 0 replies; 15+ messages in thread
From: Greg Kroah-Hartman @ 2019-11-26 14:18 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: KVM list, Steven Rostedt, Christian Borntraeger, Alex Williamson,
	Peter Feiner, linux-kernel, Linus Torvalds

On Tue, Nov 26, 2019 at 11:50:29AM +0100, Paolo Bonzini wrote:
> On 26/11/19 11:09, Greg Kroah-Hartman wrote:
> > So I think there are two different things here:
> > 	- a simple data structure for in-kernel users of statistics
> > 	- a way to export statistics to userspace
> > 
> > Now if they both can be solved with the same "solution", wonderful!  But
> > don't think that you have to do both of these at the same time.
> > 
> > Which one are you trying to solve here, I can't figure it out.  Is it
> > the second one?
> 
> I already have the second in KVM using debugfs, but that's not good.  So
> I want to do:
> 
> - a simple data structure for in-kernel *publishers* of statistics
> 
> - a sysfs-based interface to export it to userspace, which looks a lot
> like KVM's debugfs-based statistics.
> 
> What we don't have to do at the same time, is a new interface to
> userspace, one that is more efficient while keeping the self-describing
> property that we agree is needed.  That is also planned, but would come
> later.

Ok, I have no objection to tying these to sysfs entries for now, but we
should be careful in how you handle the sysfs hierarchy to not make it
too complex or difficult to parse from userspace (lots of little files
does not make gathering stats easy as was already pointed out.)

for in-kernel stats, here's a note that I had that I finally found from
a different kernel developer saying what they wanted to see in something
like this:
	(Accurate) statistics counters need RMW ops, don't need memory
	ordering, usually can't be locked against writes, and may not
	care about wrapping.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: "statsfs" API design
  2019-11-26 10:09     ` Greg Kroah-Hartman
@ 2019-11-26 10:50       ` Paolo Bonzini
  2019-11-26 14:18         ` Greg Kroah-Hartman
  0 siblings, 1 reply; 15+ messages in thread
From: Paolo Bonzini @ 2019-11-26 10:50 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: KVM list, Steven Rostedt, Christian Borntraeger, Alex Williamson,
	Peter Feiner, linux-kernel, Linus Torvalds

On 26/11/19 11:09, Greg Kroah-Hartman wrote:
> So I think there are two different things here:
> 	- a simple data structure for in-kernel users of statistics
> 	- a way to export statistics to userspace
> 
> Now if they both can be solved with the same "solution", wonderful!  But
> don't think that you have to do both of these at the same time.
> 
> Which one are you trying to solve here, I can't figure it out.  Is it
> the second one?

I already have the second in KVM using debugfs, but that's not good.  So
I want to do:

- a simple data structure for in-kernel *publishers* of statistics

- a sysfs-based interface to export it to userspace, which looks a lot
like KVM's debugfs-based statistics.

What we don't have to do at the same time, is a new interface to
userspace, one that is more efficient while keeping the self-describing
property that we agree is needed.  That is also planned, but would come
later.

Paolo


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: "statsfs" API design
  2019-11-10 13:04   ` Paolo Bonzini
@ 2019-11-26 10:09     ` Greg Kroah-Hartman
  2019-11-26 10:50       ` Paolo Bonzini
  0 siblings, 1 reply; 15+ messages in thread
From: Greg Kroah-Hartman @ 2019-11-26 10:09 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: KVM list, Steven Rostedt, Christian Borntraeger, Alex Williamson,
	Peter Feiner, linux-kernel, Linus Torvalds

On Sun, Nov 10, 2019 at 02:04:58PM +0100, Paolo Bonzini wrote:
> On 09/11/19 16:49, Greg Kroah-Hartman wrote:
> > On Wed, Nov 06, 2019 at 04:56:25PM +0100, Paolo Bonzini wrote:
> >> Hi all,
> >>
> >> statsfs is a proposal for a new Linux kernel synthetic filesystem, to be
> >> mounted in /sys/kernel/stats, which exposes subsystem-level statistics
> >> in sysfs.  Reading need not be particularly lightweight, but writing
> >> must be fast.  Therefore, statistics are gathered at a fine-grain level
> >> in order to avoid locking or atomic operations, and then aggregated by
> >> statsfs until the desired granularity.
> > 
> > Wait, reading a statistic from userspace can be slow, but writing to it
> > from userspace has to be fast?  Or do you mean the speed is all for
> > reading/writing the value within the kernel?
> 
> Reading/writing from the kernel.  Reads from a userspace are a superset
> of reading from the kernel, writes from userspace are irrelevant.
> 
> [...]
> 
> >> As you can see, values are basically integers stored somewhere in a
> >> struct.   The statsfs_value struct also includes information on which
> >> operations (for example sum, min, max, average, count nonzero) it makes
> >> sense to expose when the values are aggregated.
> > 
> > What can userspace do with that info?
> 
> The basic usage is logging.  A turbostat-like tool that is able to use
> both debugfs and tracepoints is already in tools/kvm/kvm_stat.
> 
> > I have some old notes somewhere about what people really want when it
> > comes to a good "statistics" datatype, that I was thinking of building
> > off of, but that seems independant of what you are doing here, right?
> > This is just exporting existing values to userspace in a semi-sane way?
> 
> For KVM yes.  But while I'm at it, I'd like the subsystem to be useful
> for others so if you can dig out those notes I can integrate that.
> 
> > Anyway, I like the idea, but what about how this is exposed to
> > userspace?  The criticism of sysfs for statistics is that it is too slow
> > to open/read/close lots of files and tough to get "at this moment in
> > time these are all the different values" snapshots easily.  How will
> > this be addressed here?
> 
> Individual files in sysfs *should* be the first format to export
> statsfs, since quick scripts are an important usecase.  However, another
> advantage of having a higher-level API is that other ways to access the
> stats can be added transparently.
> 
> The main requirement for that is self-descriptiveness, blindly passing
> structs to userspace is certainly the worst format of all.  But I don't
> like the idea of JSON or anything ASCII; that adds overhead to both
> production and parsing, for no particular reason.   Tracepoints already
> do something like that to export arguments, so perhaps there is room to
> reuse code or at least some ideas.  It could be binary sysfs files
> (again like tracing) or netlink, I haven't thought about it at all.

So I think there are two different things here:
	- a simple data structure for in-kernel users of statistics
	- a way to export statistics to userspace

Now if they both can be solved with the same "solution", wonderful!  But
don't think that you have to do both of these at the same time.

Which one are you trying to solve here, I can't figure it out.  Is it
the second one?

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: "statsfs" API design
  2019-11-09 15:49 ` Greg Kroah-Hartman
@ 2019-11-10 13:04   ` Paolo Bonzini
  2019-11-26 10:09     ` Greg Kroah-Hartman
  0 siblings, 1 reply; 15+ messages in thread
From: Paolo Bonzini @ 2019-11-10 13:04 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: KVM list, Steven Rostedt, Christian Borntraeger, Alex Williamson,
	Peter Feiner, linux-kernel, Linus Torvalds

On 09/11/19 16:49, Greg Kroah-Hartman wrote:
> On Wed, Nov 06, 2019 at 04:56:25PM +0100, Paolo Bonzini wrote:
>> Hi all,
>>
>> statsfs is a proposal for a new Linux kernel synthetic filesystem, to be
>> mounted in /sys/kernel/stats, which exposes subsystem-level statistics
>> in sysfs.  Reading need not be particularly lightweight, but writing
>> must be fast.  Therefore, statistics are gathered at a fine-grain level
>> in order to avoid locking or atomic operations, and then aggregated by
>> statsfs until the desired granularity.
> 
> Wait, reading a statistic from userspace can be slow, but writing to it
> from userspace has to be fast?  Or do you mean the speed is all for
> reading/writing the value within the kernel?

Reading/writing from the kernel.  Reads from a userspace are a superset
of reading from the kernel, writes from userspace are irrelevant.

[...]

>> As you can see, values are basically integers stored somewhere in a
>> struct.   The statsfs_value struct also includes information on which
>> operations (for example sum, min, max, average, count nonzero) it makes
>> sense to expose when the values are aggregated.
> 
> What can userspace do with that info?

The basic usage is logging.  A turbostat-like tool that is able to use
both debugfs and tracepoints is already in tools/kvm/kvm_stat.

> I have some old notes somewhere about what people really want when it
> comes to a good "statistics" datatype, that I was thinking of building
> off of, but that seems independant of what you are doing here, right?
> This is just exporting existing values to userspace in a semi-sane way?

For KVM yes.  But while I'm at it, I'd like the subsystem to be useful
for others so if you can dig out those notes I can integrate that.

> Anyway, I like the idea, but what about how this is exposed to
> userspace?  The criticism of sysfs for statistics is that it is too slow
> to open/read/close lots of files and tough to get "at this moment in
> time these are all the different values" snapshots easily.  How will
> this be addressed here?

Individual files in sysfs *should* be the first format to export
statsfs, since quick scripts are an important usecase.  However, another
advantage of having a higher-level API is that other ways to access the
stats can be added transparently.

The main requirement for that is self-descriptiveness, blindly passing
structs to userspace is certainly the worst format of all.  But I don't
like the idea of JSON or anything ASCII; that adds overhead to both
production and parsing, for no particular reason.   Tracepoints already
do something like that to export arguments, so perhaps there is room to
reuse code or at least some ideas.  It could be binary sysfs files
(again like tracing) or netlink, I haven't thought about it at all.

Paolo


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: "statsfs" API design
  2019-11-06 15:56 Paolo Bonzini
@ 2019-11-09 15:49 ` Greg Kroah-Hartman
  2019-11-10 13:04   ` Paolo Bonzini
  0 siblings, 1 reply; 15+ messages in thread
From: Greg Kroah-Hartman @ 2019-11-09 15:49 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: KVM list, Steven Rostedt, Christian Borntraeger, Alex Williamson,
	Peter Feiner, linux-kernel, Linus Torvalds

On Wed, Nov 06, 2019 at 04:56:25PM +0100, Paolo Bonzini wrote:
> Hi all,
> 
> statsfs is a proposal for a new Linux kernel synthetic filesystem, to be
> mounted in /sys/kernel/stats, which exposes subsystem-level statistics
> in sysfs.  Reading need not be particularly lightweight, but writing
> must be fast.  Therefore, statistics are gathered at a fine-grain level
> in order to avoid locking or atomic operations, and then aggregated by
> statsfs until the desired granularity.

Wait, reading a statistic from userspace can be slow, but writing to it
from userspace has to be fast?  Or do you mean the speed is all for
reading/writing the value within the kernel?

> The first user of statsfs would be KVM, which is currently exposing its
> stats in debugfs.  However, debugfs access is now limited by the
> security lock down patches, and in addition statsfs aims to be a
> more-or-less stable API, hence the idea of making it a separate
> filesystem and mount point.

Nice, I've had people ask about something like this for a while now.
For the most part they just dump stuff in sysfs instead (see the DRM
patches recently for people attempting to do that for debugfs values as
well.)

> A few people have already expressed interest in this.  Christian
> Borntraeger presented on the kvm_stat tool recently at KVM Forum and was
> also thinking about using some high-level API in debugfs.  Google has
> KVM patches to gather statistics in a binary format; it may be useful to
> add this kind of functionality (and some kind of introspection similar
> to what tracing does) to statsfs too in the future, but this is
> independent from the kernel API.  I'm also CCing Alex Williamson, in
> case VFIO is interested in something similar, and Steven Rostedt because
> apparently he has enough free time to write poetry in addition to code.
> 
> There are just two concepts in statsfs, namely "values" (aka files) and
> "sources" (directories).
> 
> A value represents a single quantity that is gathered by the statsfs
> client.  It could be the number of vmexits of a given kind, the amount
> of memory used by some data structure, the length of the longest hash
> table chain, or anything like that.
> 
> Values are described by a struct like this one:
> 
> 	struct statsfs_value {
> 		const char *name;
> 		enum stat_type type;	/* STAT_TYPE_{BOOL,U64,...} */
> 		u16 aggr_kind;		/* Bitmask with zero or more of
> 					 * STAT_AGGR_{MIN,MAX,SUM,...}
> 					 */
> 		u16 mode;		/* File mode */
> 		int offset;		/* Offset from base address
> 					 * to field containing the value
> 					 */
> 	};
> 
> As you can see, values are basically integers stored somewhere in a
> struct.   The statsfs_value struct also includes information on which
> operations (for example sum, min, max, average, count nonzero) it makes
> sense to expose when the values are aggregated.

What can userspace do with that info?

> Sources form the bulk of the statsfs API.  They can include two kinds of
> elements:
> 
> - values as described above.  The common case is to have many values
> with the same base address, which are represented by an array of struct
> statsfs_value
> 
> - subordinate sources
> 
> Adding a subordinate source has two effects:
> 
> - it creates a subdirectory for each subordinate source
> 
> - for each value in the subordinate sources which has aggr_kind != 0,
> corresponding values will be created in the parent directory too.  If
> multiple subordinate sources are backed by the same array of struct
> statsfs_value, values from all those sources will be aggregated.  That
> is, statsfs will compute these from the values of all items in the list
> and show them in the parent directory.
> 
> Writable values can only be written with a value of zero. Writing zero
> to an aggregate zeroes all the corresponding values in the subordinate
> sources.
> 
> Sources are manipulated with these four functions:
> 
> 	struct statsfs_source *statsfs_source_create(const char *fmt,
> 						     ...);
> 	void statsfs_source_add_values(struct statsfs_source *source,
> 				       struct statsfs_value *stat,
> 				       int n, void *ptr);
> 	void statsfs_source_add_subordinate(
> 					struct statsfs_source *source,
> 					struct statsfs_source *sub);
> 	void statsfs_source_remove_subordinate(
> 					struct statsfs_source *source,
> 					struct statsfs_source *sub);
> 
> Sources are reference counted, and for this reason there is also a pair
> of functions in the usual style:
> 
> 	void statsfs_source_get(struct statsfs_source *);
> 	void statsfs_source_put(struct statsfs_source *);
> 
> Finally,
> 
> 	void statsfs_source_register(struct statsfs_source *source);
> 
> lets you create a toplevel statsfs directory.
> 
> As a practical example, KVM's usage of debugfs could be replaced by
> something like this:
> 
> /* Globals */
> 	struct statsfs_value vcpu_stats[] = ...;
> 	struct statsfs_value vm_stats[] = ...;
> 	static struct statsfs_source *kvm_source;
> 
> /* On module creation */
> 	kvm_source = statsfs_source_create("kvm");
> 	statsfs_source_register(kvm_source);
> 
> /* On VM creation */
> 	kvm->src = statsfs_source_create("%d-%d\n",
> 				         task_pid_nr(current), fd);
> 	statsfs_source_add_values(kvm->src, vm_stats,
> 				  ARRAY_SIZE(vm_stats),
> 				  &kvm->stats);
> 	statsfs_source_add_subordinate(kvm_source, kvm->src);
> 
> /* On vCPU creation */
> 	vcpu_src = statsfs_source_create("vcpu%d\n", vcpu->vcpu_id);
> 	statsfs_source_add_values(vcpu_src, vcpu_stats,
> 				  ARRAY_SIZE(vcpu_stats),
> 				  &vcpu->stats);
> 	statsfs_source_add_subordinate(kvm->src, vcpu_src);
> 	/*
> 	 * No need to keep the vcpu_src around since there's no
> 	 * separate vCPU deletion event; rely on refcount
> 	 * exclusively.
> 	 */
> 	statsfs_source_put(vcpu_src);
> 
> /* On VM deletion */
> 	statsfs_source_remove_subordinate(kvm_source, kvm->src);
> 	statsfs_source_put(kvm->src);
> 
> /* On KVM exit */
> 	statsfs_source_put(kvm_source);
> 
> How does this look?

Where does the actual values get changed that get reflected in the
filesystem?

I have some old notes somewhere about what people really want when it
comes to a good "statistics" datatype, that I was thinking of building
off of, but that seems independant of what you are doing here, right?
This is just exporting existing values to userspace in a semi-sane way?

Anyway, I like the idea, but what about how this is exposed to
userspace?  The criticism of sysfs for statistics is that it is too slow
to open/read/close lots of files and tough to get "at this moment in
time these are all the different values" snapshots easily.  How will
this be addressed here?

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 15+ messages in thread

* "statsfs" API design
@ 2019-11-06 15:56 Paolo Bonzini
  2019-11-09 15:49 ` Greg Kroah-Hartman
  0 siblings, 1 reply; 15+ messages in thread
From: Paolo Bonzini @ 2019-11-06 15:56 UTC (permalink / raw)
  To: KVM list
  Cc: Steven Rostedt, Christian Borntraeger, Alex Williamson,
	Peter Feiner, Greg Kroah-Hartman, linux-kernel, Linus Torvalds

Hi all,

statsfs is a proposal for a new Linux kernel synthetic filesystem, to be
mounted in /sys/kernel/stats, which exposes subsystem-level statistics
in sysfs.  Reading need not be particularly lightweight, but writing
must be fast.  Therefore, statistics are gathered at a fine-grain level
in order to avoid locking or atomic operations, and then aggregated by
statsfs until the desired granularity.

The first user of statsfs would be KVM, which is currently exposing its
stats in debugfs.  However, debugfs access is now limited by the
security lock down patches, and in addition statsfs aims to be a
more-or-less stable API, hence the idea of making it a separate
filesystem and mount point.

A few people have already expressed interest in this.  Christian
Borntraeger presented on the kvm_stat tool recently at KVM Forum and was
also thinking about using some high-level API in debugfs.  Google has
KVM patches to gather statistics in a binary format; it may be useful to
add this kind of functionality (and some kind of introspection similar
to what tracing does) to statsfs too in the future, but this is
independent from the kernel API.  I'm also CCing Alex Williamson, in
case VFIO is interested in something similar, and Steven Rostedt because
apparently he has enough free time to write poetry in addition to code.

There are just two concepts in statsfs, namely "values" (aka files) and
"sources" (directories).

A value represents a single quantity that is gathered by the statsfs
client.  It could be the number of vmexits of a given kind, the amount
of memory used by some data structure, the length of the longest hash
table chain, or anything like that.

Values are described by a struct like this one:

	struct statsfs_value {
		const char *name;
		enum stat_type type;	/* STAT_TYPE_{BOOL,U64,...} */
		u16 aggr_kind;		/* Bitmask with zero or more of
					 * STAT_AGGR_{MIN,MAX,SUM,...}
					 */
		u16 mode;		/* File mode */
		int offset;		/* Offset from base address
					 * to field containing the value
					 */
	};

As you can see, values are basically integers stored somewhere in a
struct.   The statsfs_value struct also includes information on which
operations (for example sum, min, max, average, count nonzero) it makes
sense to expose when the values are aggregated.

Sources form the bulk of the statsfs API.  They can include two kinds of
elements:

- values as described above.  The common case is to have many values
with the same base address, which are represented by an array of struct
statsfs_value

- subordinate sources

Adding a subordinate source has two effects:

- it creates a subdirectory for each subordinate source

- for each value in the subordinate sources which has aggr_kind != 0,
corresponding values will be created in the parent directory too.  If
multiple subordinate sources are backed by the same array of struct
statsfs_value, values from all those sources will be aggregated.  That
is, statsfs will compute these from the values of all items in the list
and show them in the parent directory.

Writable values can only be written with a value of zero. Writing zero
to an aggregate zeroes all the corresponding values in the subordinate
sources.

Sources are manipulated with these four functions:

	struct statsfs_source *statsfs_source_create(const char *fmt,
						     ...);
	void statsfs_source_add_values(struct statsfs_source *source,
				       struct statsfs_value *stat,
				       int n, void *ptr);
	void statsfs_source_add_subordinate(
					struct statsfs_source *source,
					struct statsfs_source *sub);
	void statsfs_source_remove_subordinate(
					struct statsfs_source *source,
					struct statsfs_source *sub);

Sources are reference counted, and for this reason there is also a pair
of functions in the usual style:

	void statsfs_source_get(struct statsfs_source *);
	void statsfs_source_put(struct statsfs_source *);

Finally,

	void statsfs_source_register(struct statsfs_source *source);

lets you create a toplevel statsfs directory.

As a practical example, KVM's usage of debugfs could be replaced by
something like this:

/* Globals */
	struct statsfs_value vcpu_stats[] = ...;
	struct statsfs_value vm_stats[] = ...;
	static struct statsfs_source *kvm_source;

/* On module creation */
	kvm_source = statsfs_source_create("kvm");
	statsfs_source_register(kvm_source);

/* On VM creation */
	kvm->src = statsfs_source_create("%d-%d\n",
				         task_pid_nr(current), fd);
	statsfs_source_add_values(kvm->src, vm_stats,
				  ARRAY_SIZE(vm_stats),
				  &kvm->stats);
	statsfs_source_add_subordinate(kvm_source, kvm->src);

/* On vCPU creation */
	vcpu_src = statsfs_source_create("vcpu%d\n", vcpu->vcpu_id);
	statsfs_source_add_values(vcpu_src, vcpu_stats,
				  ARRAY_SIZE(vcpu_stats),
				  &vcpu->stats);
	statsfs_source_add_subordinate(kvm->src, vcpu_src);
	/*
	 * No need to keep the vcpu_src around since there's no
	 * separate vCPU deletion event; rely on refcount
	 * exclusively.
	 */
	statsfs_source_put(vcpu_src);

/* On VM deletion */
	statsfs_source_remove_subordinate(kvm_source, kvm->src);
	statsfs_source_put(kvm->src);

/* On KVM exit */
	statsfs_source_put(kvm_source);

How does this look?

Paolo

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2019-11-26 14:18 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-11-09 18:44 "statsfs" API design Alexey Dobriyan
2019-11-10  9:14 ` Greg KH
2019-11-10 10:09   ` Brian Masney
2019-11-10 10:14     ` Greg KH
2019-11-10 10:19       ` Greg KH
2019-11-10 15:34   ` Alexey Dobriyan
2019-11-10 20:58     ` Paolo Bonzini
2019-11-11 20:40       ` Alexey Dobriyan
2019-11-26 10:07         ` Greg KH
  -- strict thread matches above, loose matches on Subject: below --
2019-11-06 15:56 Paolo Bonzini
2019-11-09 15:49 ` Greg Kroah-Hartman
2019-11-10 13:04   ` Paolo Bonzini
2019-11-26 10:09     ` Greg Kroah-Hartman
2019-11-26 10:50       ` Paolo Bonzini
2019-11-26 14:18         ` Greg Kroah-Hartman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).