On Dec 10, 2020, at 21:04, NeilBrown > wrote: I've been exploring an alternate approach to preparing lustre code for upstream submission. Neil, sorry for not responding to this email in a timely manner. I'd read it when you first posted it, but then thought "I should take time to read it closely and reply in a thoughtful manner" but it got lost in my inbox. As you probably know, the approach we have been following is based on the code that was removed from 'drivers/staging' a couple of years ago. I reverted that removal patch and have ported all the relevant patches from lustre "master" across to that tree, as well as other cleanups and fixes. A problem with this approach is that I find that I don't trust it. I've found multiple examples of patches being ported across incorrectly, of patches being missed, and of bugs introduced while the code was in drivers/staging. I've found and fixed a lot of issues. I doubt that I've found them all. While I do some testing of this code, as does James, it is minimal compared to the testing done on lustre/master. And as we should all remember: untested code is buggy code. That was my concern with this approach also. When Greg KH was keeping it in the staging tree, it ended up getting drive-by patches from all kinds of developers that ended up adding bugs in patches that weren't reviewed or tested. Some time ago, we'd offered to run automated testing for the fs/linux-staging repo in our autotest, but that essentially requires running everything through Gerrit, which is not the preferred workflow for upstream kernels. People who use lustre don't only want correctness and performance, they want confidence, and I'm not confident of this code. My alternate approach is to use the code straight out of lustre/master, transformed to match upstream requirements by a relatively simple script. There is clearly still room for the script to introduce problems, but it is much less likely and any problems introduced are less likely to go unnoticed. I think you & James have been doing a great job cleaning up the code in master to be more in line with the upstream client, and I agree continuing in that direction (bringing the master into acceptable shape) is the only feasible way to move forward. I have written such a script, based largely on sed, unifdef, and spatch (the latter from the coccinelle project). Not all changes that I want can be achieved with these tools so I also have a collection of patches to lustre which hopefully will be submitted in due course. Many are the same sorts of cleanups that I've been submitting for a while. Others are a bit different. For example I add a lot more "#ifdef HAVE_SERVER_SUPPORT" preprocessor directives so that "unifdef" can strip out more server-only code. I've also added some "#ifndef UPSTREAM_LINUX" directives to remove code that cannot go upstream, like the code for extracting a jobid from the environment. I know Peng Tao had originally written such a script when he first pushed the Lustre code into the upstream kernel, but I don't think it was ever maintained after the initial push. That would definitely make it more clear what deltas exist between master and the upstream kernel versions. You can see my current lustre tree at https://github.com/neilbrown/lustre/tree/lustre-next This contains 119 patches on top of master, some of which have already been submitted to gerrit, others could be easily, others need a lot of work first. The last patch adds the "copy-client" script. We are (finally) getting close to releasing 2.14.0 (definitely later than we had hoped), and that will open up the gates again for the code cleanup patches to land. While there have definitely been some bugs introduced from those patches, like you wrote above they eventually are found with proper testing, and we benefit from removing many layers of cruft from the old code. In https://github.com/neilbrown/linux/tree/lustre/lustre-next you can find a very recent linux/master tree with two patches added. The first was auto-generated by the script from lustre/lustre-next. The second fixes some compile errors with fscrypt() (though it may actually break the code, I don't know). Note that the copied LNET code is TCP-only. I left out the o2ib code as I don't think we need that for the first code drop. Apart from providing the code mentioned above, this little project has also helped me form a clearer picture of what needs to be done before upstream submission is worth attempting. My list includes: - IPv6 support. I have a lot of patches which add support for extended NIDs which can store IPv6 addresses. There is more to be done here (and thanks to Andreas for his review which I haven't responded to yet). One awkwardness in the conversion that I haven't addressed at all yet is ptlrpc/nodemap_range.c. This allows a range of IP address to have a particular nodemap. This might make sense for IPv4, but makes somewhat less sense for IPv6. I'm storing IPv6 addresses in network-byte-order (which is the normal way to store addresses), and that make ranges particularly cumbersome. One idea is to use netmasks rather than ranges to define subsets, for IPv6 at least. Could you please file a ticket in Jira to track that issue. Definitely nodemaps have become important at some sites for management of multiple clients, so we don't want to lose that functionality. - fscrypt. There is a comment in lustre-core.m4 which suggests that lustre does not support file name encryption, and so cannot use the fscrypt support in Linux. If I understand that correctly, then this needs to be rectified. This is planned to be fixed in 2.15, it just wasn't ready in time for the feature cutoff for 2.14. The 2.14 release only has file data encryption. - I want to deprecate libcfs as there isn't anywhere convenient to put it - as a whole - in the linux kernel. It currently contains a mixture of things: = back-compat code, which Linux obviously doesn't needed - workitem.c and hash.c which are being replaced by workqueue and rhashables - cpt code, which is largely used by lnet, so can go in lnet/lnet - heap tree, which can go in lustre/ptlrpc as it is only used by nrs. - fail.c error injection. It would be nice to convert lustre to use the error-injection in linux/lib. This is a completely different user-space interface management. Is the current interface seen as a "stable api", or is it only used by lustre/tests (which can obviously be changed). There aren't (or shouldn't be) any users of fail_loc outside of the test framework. That said, we do need some kind of interoperability for testing new/old clients/servers together, so there can't be a wholesale removal of this functionality. - tracefile/debug. I think this should be changed to use tracepoints. I believe the capabilities are similar, though the details are different. I think I've raised this before and had an explanation of why people aren't keen. Maybe it is time to revisit that. LU-8980 was closed "Won't Fix" with no explanatory comment... I think the issue here was that the patches to wrap CDEBUG() with tracepoints was really unwieldy. On the flip side, there are a *lot* of CDEBUG() messages, and converting them all to tracepoints would be lots of work, since each tracepoint is essentially "hand rolled". Also, enabling/disabling CDEBUG is easy for users to do for collecting debug logs to analyze a problem, while the same can't be said for tracepoints. IMHO, I don't think there is a requirement to remove CDEBUG() to go upstream. Lots of other kernel components have similar debug macros to enable/disable messages in the kernel. The big difference is the ring buffer for tracefile. If we could use some other pre-existing ring buffer for that (which was under development at some point, but I don't know if it ever made it into the kernel) then we could drop the parts that I think raise objections. - libcfs_string - I've converted nearly all of this to use functionality already in Linux. Only cfs_str2mask() remains. That is mostly part of CDEBUG/tracefile and could be moved there. - adler32 - hopefully this can be copied to crypto/ in linux. Yes, the code is already in the kernel, it just isn't available via CryptoAPI. - move stuff out of /proc I understand that part of this is statistics, which don't fit in debugfs as that is root-only, and don't fit in /sys at that is one-value-per-file. Firstly I'd like to resolve everything that is not stats. James has a netlink proposal for stats. NFS uses /proc/self/mountstats to present stats info. IMHO, each filesystem/subsystem having its own proc-like virtual filesystem to export stats/tunables from the kernel is not a win over just using /proc (probably a net loss because each one implements the same boilerplate code to "avoid using /proc"), but that seems to have been what happened in the end. I plan to create some jira issues to cover: - moving code out of libcfs - fault injection - adding more HAVE_SERVER_SUPPORT and UPSTREAM_LINUX directives and start submitting some of the patches I have, and work on some more. I'll probably keep my existing linux-lustre tree (based on drivers/staging) up to date, as there is still valuable stuff in there to be ported across, and being able to 'diff' the two (with filtering) helps. Yes, keeping track of the deltas is definitely interesting for both tracking what may still be needed to port to master, as well as the off chance that there may be bugs fixed in the staging tree that weren't in master. Cheers, Andreas -- Andreas Dilger Principal Lustre Architect Whamcloud