All of lore.kernel.org
 help / color / mirror / Atom feed
* SPDX in the kernel: State of the union
@ 2022-05-17 23:31 Thomas Gleixner
  2022-05-18 13:42 ` Allison Randal
  0 siblings, 1 reply; 5+ messages in thread
From: Thomas Gleixner @ 2022-05-17 23:31 UTC (permalink / raw)
  To: linux-spdx

Folks!

After the initial SPDX effort which ended about three years ago there
was not really much progress neither in terms of file statistics nor in
terms of activity on this list... I'm refraining from asking the obvious
questions...

Nevertheless I'm trying to cut myself some cycles to get this rolling
again.

As a first step I tried to resurrect my old scripts. That was not really an
enjoyable experience due to the python2 -> python3 fallout and the changes
in scancode since then.

Though after quite some cursing I was able to gather at least initial
statistics and to analyze patches based on the scancode detection rules.

I surely have to say quite some words about the 'improved' scancode
detection rules too, but I sort that out with Philippe off-list.

So here is where we are:

Files without SPDX identifier:		16410	~78% of total files

Files without any license hint:	         7131   ~43% of !SPDX'ed files
Files with one license hint:		 6673   ~40% of !SPDX'ed files
Files with two license hints:            2267   ~13% of !SPDX'ed files
Files with more than two hints:           339   ~ 2% of !SPDX'ed files

Files with less than 4 lines content:

        0 length:	   33   (some can be removed)
	1 line:		  276
	2 lines:	  109
	3 lines:	  135

Files without any license hint:

        arch                 774
	block		       1
	certs		       2
	crypto		      10
	Documentation	    4266
	drivers		     320
	fs		      26
	include		     124
	init		       0
	ipc		       0
	kernel		      14
	lib		      26
	mm		       3
	net		      15
	samples		       7
	scripts		      63
	security	       8
	sound		       9
	tools		    1457
	usr		       0
	virt		       0

Files with one license hint:

        arch		    1405
	block		       0
	certs		       1
	crypto		       1
	Documentation	      65
	drivers		    4369
	fs		     126
	include		     356
	init		       0
	ipc		       1
	kernel		      18
	lib		      35
	mm		       4
	net		      69
	samples		      14
	scripts		      26
	security	       0
	sound		      40
	tools		     141
	usr		       1
	virt		       0

Files with two license hints:

        arch		     731
	block		       0
	certs		       0
	crypto		       3
	Documentation	      13
	drivers		    1114
	fs		      66
	include		     101
	init		       0
	ipc		       0
	kernel		       0
	lib		      54
	mm		       0
	net		      91
	samples		      39
	scripts		       5
	security	       1
	sound		      14
	tools		      35
	usr		       0
	virt		       0

Script-able files with reasonable effort:

	No hint:            6501 ~90% of no-hint files
	One hint:	    5129 ~76% of one-hint files
	Two hints:	     584 ~25% of two-hint files
	Total:		   12213 ~75% of !SPDX'ed file

	Remaining:          4197 ~5% of total files

Scancode rules involved:     561
Scancode rules validated:    117

My plan is to focus on the 'low hanging' fruit of reasonably easy
script-able files first.

For the files with zero hints that requires a few questions to be answered
upfront:

   1) What's the approach for files with obviously not copyright-able
      content:

      - Files which just include other file[s] (one or two lines)

      - Files which have just a more or less useful comment why they
      	are otherwise empty (one to three lines)

      - Files which just contain a #define FOO and an include of
        another file to compile the included file with some other
	functionality (two or three lines)

   2) What's the approach for machine generated files:

      - Primarily kernel configuration files

   3) What's the approach for 'hidden' dot-files like .gitignore:

      Those files are just providing information to tools. The file format
      is defined by the tool (git, clang, coccinelle....) and the creative
      content is exactly zero...

   4) What's the approch for binary blobs or other files which cannot carry
      license information in the file itself?

Which is related to the discussion in this thread:

  https://lore.kernel.org/all/20220516101901.475557433@linutronix.de

The other question for these files with zero hints is which license to
chose. Sure you can argue that all files w/o any hint fall under the
project license, but especially the Documentation directory is interesting
as it's not clear for all of the various content what the preferred and
assumed license should be. That needs some thoughts and clarifications.
For the kernel code itself that's not a real question, but the tools
directory might need some care too.

For the files which have a licensing hint in whatever form, I think
resuming the work where we left off, i.e. mainly reviewing per scancode
match rules based patterns, makes a lot of sense.

Based on my cursory validation of those patterns I'm confident that we can
reach a 95% coverage within a reasonable amount of time.

Finally here is another round of important questions:

  #1 Is there still interest to get this done? The silence on this list
     after the initial effort is deafening.

  #2 Are there still enough interested and comptent people on this list to
     handle the legal questions?

  #3 Was there any progress on the outstanding questions on this list where
     discussion dried out almost 3 years ago?

I'm willing to pull the cart again, but if the interest and support stays
around zero, I surely have other things to do.

Thanks,

	Thomas

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: SPDX in the kernel: State of the union
  2022-05-17 23:31 SPDX in the kernel: State of the union Thomas Gleixner
@ 2022-05-18 13:42 ` Allison Randal
  2022-05-20 15:37   ` Thomas Gleixner
  0 siblings, 1 reply; 5+ messages in thread
From: Allison Randal @ 2022-05-18 13:42 UTC (permalink / raw)
  To: Thomas Gleixner, linux-spdx

On 5/17/22 7:31 PM, Thomas Gleixner wrote:
> 
> Finally here is another round of important questions:
> 
>   #1 Is there still interest to get this done? The silence on this list
>      after the initial effort is deafening.

This list had 210 messages in 2021, and 64 so far in 2022, which may be
silence compared to LKML, but is reasonably respectable ongoing traffic
for a small cleanup project.

I'm still reviewing all patches as they flow through this list. I
haven't been actively marking them as reviewed-by me, but I would raise
any problems I saw, and I've seen others raising problems.

So, yes, there's still interest, and if you want to start generating
more patches, I'll happily contribute to the review process.

I actually thought you just ran out of easily scriptable fixes, but it's
nice to hear that there's still substantially more we can do with
scancode rules.

>   #2 Are there still enough interested and comptent people on this list to
>      handle the legal questions?

I think so, yes. If we've lost some of our reviewers, we can recruit new
ones.

With the auto-generated patches, you will probably need to rate-limit
like you did in 2019, since the tools can generate patches far more
rapidly than the humans can review them.

>   #3 Was there any progress on the outstanding questions on this list where
>      discussion dried out almost 3 years ago?

ISTR reaching conclusions on all the questions before, but if there are
some lingering, we can revive them. And, you raised good new questions too.

> I'm willing to pull the cart again, but if the interest and support stays
> around zero, I surely have other things to do.

If you have the time and energy to do another burst, go for it. I don't
know that we'll ever get to 100%, but every file we clean up is helpful,
so it's worth continuing.

Allison

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: SPDX in the kernel: State of the union
  2022-05-18 13:42 ` Allison Randal
@ 2022-05-20 15:37   ` Thomas Gleixner
  2022-05-22 15:17     ` Allison Randal
  0 siblings, 1 reply; 5+ messages in thread
From: Thomas Gleixner @ 2022-05-20 15:37 UTC (permalink / raw)
  To: Allison Randal, linux-spdx

On Wed, May 18 2022 at 09:42, Allison Randal wrote:
> On 5/17/22 7:31 PM, Thomas Gleixner wrote:
> I actually thought you just ran out of easily scriptable fixes, but it's
> nice to hear that there's still substantially more we can do with
> scancode rules.

I ran out of cycles :)

> With the auto-generated patches, you will probably need to rate-limit
> like you did in 2019, since the tools can generate patches far more
> rapidly than the humans can review them.

Sure.

> If you have the time and energy to do another burst, go for it. I don't
> know that we'll ever get to 100%, but every file we clean up is helpful,
> so it's worth continuing.

I started to get some structure into this mess. For the first step I
excluded the Documentation directory unless files in that, which fit
into match rules applying to source files. I'll tend to the
Documentation directory in a seperate step.

Then I categorized the remaining match rules into the following:

Nr   Category        Rules     Files affected
 1   GPLv2[+]	    141	         1607
 2   GPL unknown     84	         1663
 3   MIT	     28	         3275
 4   GPLv2/MIT	      2	           36
 5   BSD	     20	          114
 6   GPL/BSD	     32          1004
 7   ISC	      4           343
 8   X11	      1             3
 9   Other	      9            50
10   Unclear	     63	          916
11   Unknown	     78	          321
12   Nasty	     16	           48
13   Bogus	     21	          861

#1 Pretty clear GPLv2[or later] and LGPL matches.

#2 The nasty 'under GPL' ones. Quite some of them reference COPYING

#3-9 Pretty clear matches for MIT/BSD/ISC/X11/ZLIB and GPL combos of
     those

#10 The unclear (at least to me) ones

#11 Licenses the kernel does not have (yet) in the LICENSES
    directory, but some of them are not really clear to me

#12 GPL version 1 and version 3, reiserfs and some proprietary

#13 A set of bogosities in scancode which I need to discuss
    with Philippe.

I probably made some mistakes here and there, but that's what I have
now.

I've generated static HTML pages from the data, which are available
here:

   https://tglx.de/~tglx/spdx/index.html

so you can get a taste of what is coming to you sooner than later. The
categories link to pages with rules and the rules to a per rule details
page. The latter has links to a Linux cross reference site in case you
want to look at the real think instead of the 'normalized' match
patterns on the rule page.

My plan is to start with categories #1 and #3-9 and send out batches of
patches to the list.

Which size of batches and what rate do you folks prefer?

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: SPDX in the kernel: State of the union
  2022-05-20 15:37   ` Thomas Gleixner
@ 2022-05-22 15:17     ` Allison Randal
  2022-05-22 17:35       ` Thomas Gleixner
  0 siblings, 1 reply; 5+ messages in thread
From: Allison Randal @ 2022-05-22 15:17 UTC (permalink / raw)
  To: Thomas Gleixner, linux-spdx

On 5/20/22 11:37 AM, Thomas Gleixner wrote:
> 
> I ran out of cycles :)

Nod, it happens. Really, it's a planet-wide phenomenon the past couple
of years. :)

> I started to get some structure into this mess.
[...]
> I've generated static HTML pages from the data, which are available
> here:
> 
>    https://tglx.de/~tglx/spdx/index.html

Makes sense, and a large number of them look like they'll be easy to
review and approve.

> Which size of batches and what rate do you folks prefer?

Looking back to 2019, you generally sent batches of 10-25 patches per
day, where each patch was one match rule. Seems reasonable to start
again there, and tune up or down as needed.

Allison

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: SPDX in the kernel: State of the union
  2022-05-22 15:17     ` Allison Randal
@ 2022-05-22 17:35       ` Thomas Gleixner
  0 siblings, 0 replies; 5+ messages in thread
From: Thomas Gleixner @ 2022-05-22 17:35 UTC (permalink / raw)
  To: Allison Randal, linux-spdx

On Sun, May 22 2022 at 11:17, Allison Randal wrote:
> On 5/20/22 11:37 AM, Thomas Gleixner wrote:
>> I've generated static HTML pages from the data, which are available
>> here:
>> 
>>    https://tglx.de/~tglx/spdx/index.html
>
> Makes sense, and a large number of them look like they'll be easy to
> review and approve.

I hope so.

>> Which size of batches and what rate do you folks prefer?
>
> Looking back to 2019, you generally sent batches of 10-25 patches per
> day, where each patch was one match rule. Seems reasonable to start
> again there, and tune up or down as needed.

Sounds like a plan.

Thanks,

        Thomas

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2022-05-22 17:36 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-05-17 23:31 SPDX in the kernel: State of the union Thomas Gleixner
2022-05-18 13:42 ` Allison Randal
2022-05-20 15:37   ` Thomas Gleixner
2022-05-22 15:17     ` Allison Randal
2022-05-22 17:35       ` Thomas Gleixner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.