From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-fsdevel-owner@vger.kernel.org>
Received: from mx1.redhat.com ([209.132.183.28]:52410 "EHLO mx1.redhat.com"
        rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
        id S1752305AbdK0O5y (ORCPT <rfc822;linux-fsdevel@vger.kernel.org>);
        Mon, 27 Nov 2017 09:57:54 -0500
Received: from smtp.corp.redhat.com (int-mx04.intmail.prod.int.phx2.redhat.com [10.5.11.14])
        (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
        (No client certificate requested)
        by mx1.redhat.com (Postfix) with ESMTPS id CCF4F15DF
        for <linux-fsdevel@vger.kernel.org>; Mon, 27 Nov 2017 14:57:54 +0000 (UTC)
Subject: Re: [proposal] making filesystem tools more machine friendly
To: Jan Tulak <jtulak@redhat.com>, linux-fsdevel@vger.kernel.org
References: <CACj3i70p=ybXZWRncseqwnbs7HAYu-SL02+cj--7T5YnqcwVKQ@mail.gmail.com>
From: Andrew Price <anprice@redhat.com>
Message-ID: <095aeded-39d1-c331-22cc-cdf1da069e3f@redhat.com>
Date: Mon, 27 Nov 2017 14:57:52 +0000
MIME-Version: 1.0
In-Reply-To: <CACj3i70p=ybXZWRncseqwnbs7HAYu-SL02+cj--7T5YnqcwVKQ@mail.gmail.com>
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Language: en-US
Content-Transfer-Encoding: 8bit
Sender: linux-fsdevel-owner@vger.kernel.org
List-ID: <linux-fsdevel.vger.kernel.org>

Hi,

On 30/06/17 09:17, Jan Tulak wrote:
> AKA filesystem API
> 
> Hi guys
> 
> Currently, filesystem tools are not made with automation in mind. So
> any tool that wants to interact with filesystems (be it for
> automation, or to provide a more user-friendly interface) has to
> screen scrape everything and cope with changing outputs.
> 
> I think it is the time to focus some thoughts on how to make the fs
> tools easier to be used by scripts and other tools. 

What's the status of this? I'd like to make sure gfs2-utils is geared up 
for it and catered for, whatever solution is chosen.

Perhaps this ship has already sailed, but, I think a json dependency may 
be a little too heavy, and perhaps a simpler stream of key-value lists 
that can be generated with fprintf() would suffice? For accepting 
options, we already have code to parse that sort of thing as we handle 
"foo=bar,baz=42" style strings for extended options so it shouldn't take 
much work to repurpose that. That said, I don't think passing options to 
tools in this way holds much value over specifying them in argv.

Willing to budge for consensus, though.

Andy

> Now, to ease you,
> the answer to the obvious question "who will do it" is "me". I don't
> want to force you into anything, though, so I'm opening this
> discussion pretty early with some ideas and I hope to hear from you
> what do you think about it before anything is set in the stone. (For
> those who visited Vault this year, Justin Mitchell and I had a talk
> about this, codename Springfield.)
> 
> The following text attempts to identify issues with using
> filesystems-related tools in scripts/applications and proposes a
> solution to those issues.
> 
> Content:
> 1. A quick introduction
> 2. Details of the issues
> 3. Proposed Solutions
> 4. Conclusion
> 
> 1. A quick introduction
> =================
> 
> I discussed this topic with people who are building something around
> fs tools. For example, the developer of libblockdev (Vratislav
> Podzimek, https://github.com/vpodzime/libblockdev) or system storage
> manager (was Lukas Czerner, now it is me,
> https://sourceforge.net/projects/storagemanager/), and the listed
> issues are a product of experience from working with those tools. The
> issues are related mostly to basic operations like mkfs, fsck,
> snapshots and resizing. Advanced/debugging tools like xfs_io or xfs_db
> are not in the focus, but they might benefit from any possible change
> too if included in it.
> 
> The main issues of the current state, where all the tools are run with
> some flags and options and produce a human readable output:
>   * The output format can change, sometimes without an easy way of
> detection like a version number.
>   * Different output formats for different tools even in a single FS,
> thus zero code reuse/sharing.
>   * Screenscraping can introduce bugs (a rare inner condition can add a
> field into the output and break regular expressions).
>   * No (or weak) progress report.
>   * Different filesystems have different input formats (flags, options)
> even for the most basic capabilities.
>   * Thread safety, forking
> 
> 2. Details of the issues
> ==================
> 
> Let’s look at the issues now: why it is an issue and if we can do some
> small change to fix it on its own.
> 
> 
> The output format can change, sometimes without an easy way of
> detection like a version number. Most filesystems are well behaved,
> but still, we don’t know what exactly are people doing with the tools
> and even adding a new field can possibly break a script. Keeping a
> compatibility with older versions adds another complexity to such
> tools.
> 
> What can be done about this? The new fields have to be printed somehow
> and changing the format of the standard output would break everything.
> Making sure that if the input or output changes in any way, it is
> always with a detectable difference in the version number is a good
> practice, but it doesn’t solve the issue, it only makes hacking around
> it easier.
> 
> What can really help is to have an alternative output (which can be
> turned on when the user wants it), which is easy to parse and which is
> resilient to a certain degree of changes because it can express
> dynamic items like lists or arrays: JSON, XML...
> 
> 
> Different input/output formats for different tools even in a single
> FS, thus zero code reuse/sharing: Support for every tool and every
> filesystem has to start from a scratch. Nothing can be done about it
> without a change in the output format. But if an optional JSON or XML
> or something was supported, then instead of creating a parser for
> every tool, there could be used just one standard and already a
> well-tested library.
> 
> 
> Screenscraping can introduce bugs (some rare inner condition can add a
> field into the output and break regular expressions): Well, let’s just
> look at how many services still can’t even parse and verify an email
> address correctly. And we have a lot more complex text… Again, some
> easy-to-parse format with existing libraries that would turn the text
> into a bunch of variables or an object would help.
> 
> 
> No (or weak) progress report: Especially for tools that can run for a
> long time, like fsck. Screenscraping everything it throws out and then
> deciding whether it is a progress report message (because instead of
> “25 %” it says “running operation foo”), a message to tell the user,
> or something to just ignore is a lot less comfortable and error prone
> than “{level: ‘progress’, stage: 5, msg: ‘operation foo’}”.
> 
> 
> Different filesystems have different input formats (flags, options)
> even for the most basic capabilities: Similar to “Different
> input/output formats for different tools...”. For example, for
> labeling a partition, you use mkfs.xfs -L label, but mkfs.vfat -n
> label. However, changing this requires getting the same functionality
> with a common basic specification to other filesystems too.
> 
> 
> Thread safety, forking: The people who work on a library, like
> libblockdev, doesn’t like that they have to fork another process over
> which they have no control, as they can’t guarantee anything about it.
> This can’t be fixed by changing the output format, though, but would
> require making a public library providing a similar interface as the
> existing fs tools. No detailed access to insides is needed, just a way
> how to run mkfs, fsck, snapshots, etc… without spawning another
> process and without screenscraping.
> 
> 
> 3. Proposed Solutions
> =================
> 
> There are two (complementary) ways how to address the issues: add a
> structured, machine-readable format for input/output of the tools
> (e.g. JSON, XML, …) and to create a library with the functionality of
> the existing tools. Let’s look now at those options. I will focus on
> what changes they would require, what would be the price for
> maintaining that solution and if there are any drawbacks or additional
> advantages.
> 
> An optional third option would be to create a system service/daemon,
> that would use dbus or some other structured interface to accept jobs
> and return results. I think that LVM people are working on something
> like this.
> 
> The proposed solutions are ordered in regards to their complexity.
> Also, they can be seen as follow-ups, because every proposed option
> requires big part of the work from the previous one anyway.
> 
> 
> 3.1. Structured Output
> -------------------------------
> In other words, what LVM already does with --reportformat
> {basic|json}, and lsblk with --json. Possibly, we could also make JSON
> input too. That would allow the user to, instead of using all the
> flags and options of CLI, make something like
> --jsoninput=“{dev:’/dev/abc’, force: true, … }”
> 
> Some preliminary notes about the format:
> Most likely, this would mean JSON. JSON is currently preferred over
> XML because it is easier to read by humans if the need arises, it’s
> encoder/parser is lighter and easier to use. Also, other projects like
> LVM, multipath or lsblk already uses JSON, so it would be nice to
> don’t break the group.
> 
> Required implementation changes/expected problems:
> In an ideal world, a simple replacement of all prints with a wrapping
> function would be enough. However, as far as I know, an overwhelming
> majority of the tools has printing functions spread through the code
> and prints everything as soon as it knows the specific bit of
> information.
> 
> Change of the output into a structured format means some refactoring.
> Instead of simple printf(), an object, array or structure has to be
> created and rather than pure strings, more diagnostically useful
> values have to be added into it in place of the current prints. Then,
> when it is reasonable, print it out all at once in any desired format.
> The “when reasonable” would usually mean at the end, but it could also
> print progress if it is a long time running operation.
> 
> Because of the kinds of the required changes, the implementation can
> be split into two parts: first, clean the code and move all the prints
> out from spaghetti to the end. Then, once this is done, add the
> structured format.
> 
> Maintaining overhead:
> Small. By separating the printing from generating the data, we don’t
> really care about the output format anywhere except in the printing
> function itself, and if a new field or value is added or an old one
> removed, then the amount of work is roughly equal to the current
> state.
> 
> Drawbacks:
> Searching for a place in the code where something happens would be
> more complicated - instead of simple search for a string that is the
> same in the code and in the output, one would search for the line that
> adds the data to the message log. This could be simplified by using
> __LINE__ macro with debug output. (With JSON, an additional field
> would not affect anything, so it would be completely safe.)
> 
> Additional advantages:
> The refactoring can clean up the code a bit. It is easy to add any
> other format in the future. Our own test suite could also benefit from
> the structured output.
> 
> Comment:
> The most bang for the buck option and most of the work done for this
> is required also for every other option. In terms of specific
> interface (i.e. common JSON elements), we need to identify a common
> subset of fields/options that every fs will use and anything else move
> into fs-specific extensions.
> 
> With regards to the compatibility between different filesystems, the
> best way how to specify the format might be a small library that would
> take raw data on one side and turn it into JSON string or even print
> it (and the reverse, if input supports json too). This way, we would
> be sure that there really is a common ground that works the same in
> every fs.
> 
> Another way how to achieve the compatibility is to make an RFC-like
> document. For example: All occurrences of a filesystem identifier MUST
> be in a field named 'fsid' which SHOULD contain a UUID formatted
> string. I think this is useful even if we end up with the library as a
> way to find out the common ground.
> 
> 3.2. A Library
> -------------------
> A library implementing basic functions like: mkfs(int argc, char
> *argv[]), fsck(), … etc. Once done, binding for other languages like
> Python is possible too.
> 
> Required implementation changes/expected problems:
> If the implementation of this library would follow up after the
> changes that add the structured output, then most of the work would be
> already done. The most complex issue remaining would probably be that
> there can be no exit() call - if anything fails, we have to gracefully
> return up the stack and out of the library.
> 
> A duplicity of functionality is not an issue because there is none -
> the binary tools like mkfs.xfs would become simple wrappers around the
> library functions: pass user input to the specified library call, then
> print any message created in the library and exit.
> 
> Maintaining overhead:
> None? The code would be cleaned up and moved, but there wouldn’t be
> new things to maintain.
> 
> Drawbacks:
> I can’t think out any...
> 
> Additional advantages:
> The refactoring can clean up the code a bit.
> 
> Comment:
> Useful and nice to have, but doesn’t have to be done ASAP.
> 
> 
> 3.3. A system service
> ------------------------------
> A system service/daemon, that would use dbus or some other structured
> interface to accept jobs and return results.
> 
> Required implementation changes/expected problems:
> We don’t want the daemon to exit if it can’t access a file, we don’t
> want it to do printfs(), … So, at the end, we have to do the
> structured output and library and then a lot of work above it.
> 
> Maintaining overhead:
> All the system services things plus whatever the other two solutions require.
> 
> Drawbacks:
> The biggest maintaining overhead of all proposed solutions, basically
> a new tool/subproject. Using dbus in a third party project is more
> work than to just include a library and call one function.
> 
> Additional advantages:
> It wouldn’t be possible to attempt concurrent modifications of a device.
> 
> Comment:
> In my opinion, this shouldn’t be our project. Let other people make it
> their front if they want something like this and instead, just make it
> easier for them by using one of the other solutions.
> 
> 
> 4. Conclusion
> ===========
> 
> A structured output is something we should aim for. It helps a lot
> with the issues and it is the cheapest option. If it goes well, it can
> be later on followed by creating a library, although, at this moment,
> that seems a premature thing strive for. Creating a daemon is not a
> thing any single filesystem should do on its own.
> 
> And thank you for reading all this, I look forward to your comments. :-)
> Jan
>