On Dec 10, 2020, at 21:04, NeilBrown <neilb@suse.de<mailto:neilb@suse.de>> wrote:
I've been exploring an alternate approach to preparing lustre code for
upstream submission.

Neil,
sorry for not responding to this email in a timely manner.  I'd read it when
you first posted it, but then thought "I should take time to read it closely
and reply in a thoughtful manner" but it got lost in my inbox.

As you probably know, the approach we have been following is based on
the code that was removed from 'drivers/staging' a couple of years ago.
I reverted that removal patch and have ported all the relevant patches
from lustre "master" across to that tree, as well as other cleanups and
fixes.

A problem with this approach is that I find that I don't trust it.
I've found multiple examples of patches being ported across
incorrectly, of patches being missed, and of bugs introduced while the
code was in drivers/staging.
I've found and fixed a lot of issues.  I doubt that I've found them
all.
While I do some testing of this code, as does James, it is minimal
compared to the testing done on lustre/master.  And as we should all
remember: untested code is buggy code.

That was my concern with this approach also.  When Greg KH was keeping
it in the staging tree, it ended up getting drive-by patches from all kinds of
developers that ended up adding bugs in patches that weren't reviewed or
tested.

Some time ago, we'd offered to run automated testing for the fs/linux-staging
repo in our autotest, but that essentially requires running everything through
Gerrit, which is not the preferred workflow for upstream kernels.

People who use lustre don't only want correctness and performance, they
want confidence, and I'm not confident of this code.

My alternate approach is to use the code straight out of lustre/master,
transformed to match upstream requirements by a relatively simple
script.  There is clearly still room for the script to introduce
problems, but it is much less likely and any problems introduced are
less likely to go unnoticed.

I think you & James have been doing a great job cleaning up the code in
master to be more in line with the upstream client, and I agree continuing
in that direction (bringing the master into acceptable shape) is the only
feasible way to move forward.

I have written such a script, based largely on sed, unifdef, and spatch
(the latter from the coccinelle project).

Not all changes that I want can be achieved with these tools so I also
have a collection of patches to lustre which hopefully will be
submitted in due course.  Many are the same sorts of cleanups that I've
been submitting for a while.  Others are a bit different.  For example
I add a lot more "#ifdef HAVE_SERVER_SUPPORT" preprocessor directives
so that "unifdef" can strip out more server-only code.  I've also added
some "#ifndef UPSTREAM_LINUX" directives to remove code that cannot go
upstream, like the code for extracting a jobid from the environment.

I know Peng Tao had originally written such a script when he first pushed
the Lustre code into the upstream kernel, but I don't think it was ever
maintained after the initial push.  That would definitely make it more clear
what deltas exist between master and the upstream kernel versions.

You can see my current lustre tree at
  https://github.com/neilbrown/lustre/tree/lustre-next
This contains 119 patches on top of master, some of which have already
been submitted to gerrit, others could be easily, others need a lot of
work first. The last patch adds the "copy-client" script.

We are (finally) getting close to releasing 2.14.0 (definitely later than we had
hoped), and that will open up the gates again for the code cleanup patches
to land.  While there have definitely been some bugs introduced from those
patches, like you wrote above they eventually are found with proper testing,
and we benefit from removing many layers of cruft from the old code.

In
  https://github.com/neilbrown/linux/tree/lustre/lustre-next
you can find a very recent linux/master tree with two patches added.
The first was auto-generated by the script from lustre/lustre-next.
The second fixes some compile errors with fscrypt() (though it may
actually break the code, I don't know).

Note that the copied LNET code is TCP-only.  I left out the o2ib code
as I don't think we need that for the first code drop.

Apart from providing the code mentioned above, this little project has
also helped me form a clearer picture of what needs to be done before
upstream submission is worth attempting.  My list includes:

- IPv6 support.  I have a lot of patches which add support for extended
  NIDs which can store IPv6 addresses.  There is more to be done here
  (and thanks to Andreas for his review which I haven't responded to
  yet).
  One awkwardness in the conversion that I haven't addressed at all yet
  is ptlrpc/nodemap_range.c.  This allows a range of IP address to have
  a particular nodemap.  This might make sense for IPv4, but makes
  somewhat less sense for IPv6.
  I'm storing IPv6 addresses in network-byte-order (which is the normal
  way to store addresses), and that make ranges particularly cumbersome.
  One idea is to use netmasks rather than ranges to define subsets, for
  IPv6 at least.

Could you please file a ticket in Jira to track that issue.  Definitely nodemaps
have become important at some sites for management of multiple clients,
so we don't want to lose that functionality.

- fscrypt.  There is a comment in lustre-core.m4 which suggests that
   lustre does not support file name encryption, and so cannot use
   the fscrypt support in Linux.  If I understand that correctly, then
   this needs to be rectified.

This is planned to be fixed in 2.15, it just wasn't ready in time for the feature
cutoff for 2.14.  The 2.14 release only has file data encryption.

- I want to deprecate libcfs as there isn't anywhere convenient to put
  it - as a whole - in the linux kernel. It currently contains a
  mixture of things:
  = back-compat code, which Linux obviously doesn't needed

  - workitem.c and hash.c which are being replaced by
    workqueue and rhashables

  - cpt code, which is largely used by lnet, so can go in lnet/lnet

  - heap tree, which can go in lustre/ptlrpc as it is only used by nrs.

  - fail.c error injection.  It would be nice to convert lustre to
    use the error-injection in linux/lib.  This is a completely
    different user-space interface management.
    Is the current interface seen as a "stable api", or is it only
    used by lustre/tests (which can obviously be changed).

There aren't (or shouldn't be) any users of fail_loc outside of the test
framework.  That said, we do need some kind of interoperability for
testing new/old clients/servers together, so there can't be a wholesale
removal of this functionality.

  - tracefile/debug.  I think this should be changed to use
    tracepoints. I believe the capabilities are similar, though the
    details are different.  I think I've raised this before and had an
    explanation of why people aren't keen.  Maybe it is time to revisit
    that.
    LU-8980 was closed "Won't Fix" with no explanatory comment...

I think the issue here was that the patches to wrap CDEBUG() with
tracepoints was really unwieldy.  On the flip side, there are a *lot* of
CDEBUG() messages, and converting them all to tracepoints would
be lots of work, since each tracepoint is essentially "hand rolled".
Also, enabling/disabling CDEBUG is easy for users to do for collecting
debug logs to analyze a problem, while the same can't be said for
tracepoints.

IMHO, I don't think there is a requirement to remove CDEBUG() to go
upstream.  Lots of other kernel components have similar debug macros
to enable/disable messages in the kernel.  The big difference is the ring
buffer for tracefile.  If we could use some other pre-existing ring buffer
for that (which was under development at some point, but I don't know
if it ever made it into the kernel) then we could drop the parts that I
think raise objections.

  - libcfs_string - I've converted nearly all of this to use
    functionality already in Linux.  Only cfs_str2mask() remains.

That is mostly part of CDEBUG/tracefile and could be moved there.

  - adler32 - hopefully this can be copied to crypto/ in linux.

Yes, the code is already in the kernel, it just isn't available via CryptoAPI.

- move stuff out of /proc
   I understand that part of this is statistics, which don't fit in
   debugfs as that is root-only, and don't fit in /sys at that is
   one-value-per-file.
   Firstly I'd like to resolve everything that is not stats.
   James has a netlink proposal for stats.
   NFS uses /proc/self/mountstats to present stats info.

IMHO, each filesystem/subsystem having its own proc-like virtual
filesystem to export stats/tunables from the kernel is not a win over just
using /proc (probably a net loss because each one implements the same
boilerplate code to "avoid using /proc"), but that seems to have been what
happened in the end.

I plan to create some jira issues to cover:
 - moving code out of libcfs
 - fault injection
 - adding more HAVE_SERVER_SUPPORT and UPSTREAM_LINUX directives

and start submitting some of the patches I have, and work on some more.

I'll probably keep my existing linux-lustre tree (based on
drivers/staging) up to date, as there is still valuable stuff in there
to be ported across, and being able to 'diff' the two (with filtering)
helps.

Yes, keeping track of the deltas is definitely interesting for both tracking what may
still be needed to port to master, as well as the off chance that there may be bugs
fixed in the staging tree that weren't in master.

Cheers, Andreas
--
Andreas Dilger
Principal Lustre Architect
Whamcloud