linux-spdx.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Let the fun start
@ 2019-05-07 11:53 Thomas Gleixner
  0 siblings, 0 replies; only message in thread
From: Thomas Gleixner @ 2019-05-07 11:53 UTC (permalink / raw)
  To: linux-spdx

Hi!

To get the work going I set up git repositories with tools and results. See
the document below.

As a follow up I'm going to post the first few patches from step2 (GPL
boiler plate replacement) so you get the idea how this looks like and we
can discuss how we proceed with review etc.

Thanks,

	tglx

Machine assisted license cleanup
--------------------------------

1. Tools for reproduction:

   1.1 scancode toolkit

       A license scanner tool which can be run from the command line and
       provides excellent parellelisation. While fast, its recommended to
       be run on a machine with tons of CPUs and tons of Memory.

       A run with 128 parallel scan threads takes about 15 minutes. Go
       figure how long it will take on your laptop :)

         https://github.com/nexB/scancode-toolkit

   1.2 spdx helper scripts

       A bunch of horrible python scripts with even more horrible shell
       glue.

         git://git.kernel.org/pub/scm/utils/spdx/spdx-utils

      gitweb URL:
      
         https://git.kernel.org/pub/scm/utils/spdx/spdx-utils.git

       The main workhorse is lcheck.py. I wrote it initialy to gather
       statistics and other information, but over time it evolved to a
       swiss army knife. lcheck.py --help gives you the gory details, no
       manpage sorry.

   1.3 git

       The git tools must be available.

       A clean linux tree must be cloned. Ensure that there are no
       artifacts from editing, patch directories etc.

   To reproduce the setup (in case you have a big enough machine or
   lots of time for thumb twiddling):

    - Install scancode and git. If you need help with scancode talk
      to Philipe.

    - Clone the linux kernel

    - Clone the spdx scripts

    - cd into the spdx scripts directory

    - invoke the runscript with:

      ./runall.sh path/to/linux/kernel

      The path can be relative or absolute

    - Wait ....

    - Check the results in the stepX directories

    - Chech the results in the kernel directory (each step creates a
      branch).


   For your convenience:

     The spdx-utils repository contains aside of the master branch a branch
     linux-5.1. It contains:

     - the scancode json files for each step
     - the stats.txt file for each step
     - the rules which are handled in each step
     - the resulting patches

    The resulting kernel tree is pushed to:

      git://git.kernel.org/pub/scm/linux/kernel/git/tglx/linux-spdx.git

    Branches step1, step2, step3 contain the steps documented below.

    gitweb URL:

      https://git.kernel.org/pub/scm/linux/kernel/git/tglx/linux-spdx.git

2) Approach

   The Documentation directory is ignored for now. That needs some extra
   care.

   2.1 Files with no license

       These files have not been touched during the first large sweep.

   2.1.1 Build files

   	 Make/Kconfig files without license information

   2.1.2 Source files which have only MODULE_LICENSE("GPL") and/or
   	 EXPORT_SYMBOL_GPL()

	 Now that MODULE_LICENSE is clarified this can be tackled.

   The scripts identify these files in the scanner result and add the
   proper license identifier (GPL-2.0-only)

   The scripts generate patches which can be applied with quilt or imported
   into git with 'git quiltimport'

   SPDX count goes from 22574 to 25712 (44.9%)

   2.2 Files with a single license: GPL-2.0-only or GPL-2.0-or-later

       The scripts handle the following tasks:

       - Find the affected files in the scanner output

       - Generate a list of match rules which represent a unique pattern
         This is achieved by normalizing the texts (removing formatting,
         white space damage, uppercase / lowercase and punctuation damage.

       - Add the appropriate license header and remove the boiler plate
         text or the license reference.

       - Create a patch series. Each patch contains only the modifications
         for a single match rule. The rule (and eventual variants)
	 are saved in the change log of each patch to ease review

       - Once a reference dataset (compliance data provided by Siemens) is
         available the scripts will also check for conflicts with that
	 data set.

       This results in 515 patches at the moment.

       The scripts generate patches which can be applied with quilt or
       imported into git with 'git quiltimport'

       SPDX count goes from 25712 to 46368 (80.7%)

    2.3. Files with GPL-2.9-only/or-later and Linux-OpenIB

       Basically the same as above just with dual licensing.

       SPDX count goes from 46368 to 46865 (81.9%)

    2.4  More fun later :)

       I have quite a bunch of steps in preparation but lets get the above
       agreed on and reviewed first.


^ permalink raw reply	[flat|nested] only message in thread

only message in thread, other threads:[~2019-05-07 11:54 UTC | newest]

Thread overview: (only message) (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-05-07 11:53 Let the fun start Thomas Gleixner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).