* SPDX in the kernel: State of the union
@ 2022-05-17 23:31 Thomas Gleixner
2022-05-18 13:42 ` Allison Randal
0 siblings, 1 reply; 5+ messages in thread
From: Thomas Gleixner @ 2022-05-17 23:31 UTC (permalink / raw)
To: linux-spdx
Folks!
After the initial SPDX effort which ended about three years ago there
was not really much progress neither in terms of file statistics nor in
terms of activity on this list... I'm refraining from asking the obvious
questions...
Nevertheless I'm trying to cut myself some cycles to get this rolling
again.
As a first step I tried to resurrect my old scripts. That was not really an
enjoyable experience due to the python2 -> python3 fallout and the changes
in scancode since then.
Though after quite some cursing I was able to gather at least initial
statistics and to analyze patches based on the scancode detection rules.
I surely have to say quite some words about the 'improved' scancode
detection rules too, but I sort that out with Philippe off-list.
So here is where we are:
Files without SPDX identifier: 16410 ~78% of total files
Files without any license hint: 7131 ~43% of !SPDX'ed files
Files with one license hint: 6673 ~40% of !SPDX'ed files
Files with two license hints: 2267 ~13% of !SPDX'ed files
Files with more than two hints: 339 ~ 2% of !SPDX'ed files
Files with less than 4 lines content:
0 length: 33 (some can be removed)
1 line: 276
2 lines: 109
3 lines: 135
Files without any license hint:
arch 774
block 1
certs 2
crypto 10
Documentation 4266
drivers 320
fs 26
include 124
init 0
ipc 0
kernel 14
lib 26
mm 3
net 15
samples 7
scripts 63
security 8
sound 9
tools 1457
usr 0
virt 0
Files with one license hint:
arch 1405
block 0
certs 1
crypto 1
Documentation 65
drivers 4369
fs 126
include 356
init 0
ipc 1
kernel 18
lib 35
mm 4
net 69
samples 14
scripts 26
security 0
sound 40
tools 141
usr 1
virt 0
Files with two license hints:
arch 731
block 0
certs 0
crypto 3
Documentation 13
drivers 1114
fs 66
include 101
init 0
ipc 0
kernel 0
lib 54
mm 0
net 91
samples 39
scripts 5
security 1
sound 14
tools 35
usr 0
virt 0
Script-able files with reasonable effort:
No hint: 6501 ~90% of no-hint files
One hint: 5129 ~76% of one-hint files
Two hints: 584 ~25% of two-hint files
Total: 12213 ~75% of !SPDX'ed file
Remaining: 4197 ~5% of total files
Scancode rules involved: 561
Scancode rules validated: 117
My plan is to focus on the 'low hanging' fruit of reasonably easy
script-able files first.
For the files with zero hints that requires a few questions to be answered
upfront:
1) What's the approach for files with obviously not copyright-able
content:
- Files which just include other file[s] (one or two lines)
- Files which have just a more or less useful comment why they
are otherwise empty (one to three lines)
- Files which just contain a #define FOO and an include of
another file to compile the included file with some other
functionality (two or three lines)
2) What's the approach for machine generated files:
- Primarily kernel configuration files
3) What's the approach for 'hidden' dot-files like .gitignore:
Those files are just providing information to tools. The file format
is defined by the tool (git, clang, coccinelle....) and the creative
content is exactly zero...
4) What's the approch for binary blobs or other files which cannot carry
license information in the file itself?
Which is related to the discussion in this thread:
https://lore.kernel.org/all/20220516101901.475557433@linutronix.de
The other question for these files with zero hints is which license to
chose. Sure you can argue that all files w/o any hint fall under the
project license, but especially the Documentation directory is interesting
as it's not clear for all of the various content what the preferred and
assumed license should be. That needs some thoughts and clarifications.
For the kernel code itself that's not a real question, but the tools
directory might need some care too.
For the files which have a licensing hint in whatever form, I think
resuming the work where we left off, i.e. mainly reviewing per scancode
match rules based patterns, makes a lot of sense.
Based on my cursory validation of those patterns I'm confident that we can
reach a 95% coverage within a reasonable amount of time.
Finally here is another round of important questions:
#1 Is there still interest to get this done? The silence on this list
after the initial effort is deafening.
#2 Are there still enough interested and comptent people on this list to
handle the legal questions?
#3 Was there any progress on the outstanding questions on this list where
discussion dried out almost 3 years ago?
I'm willing to pull the cart again, but if the interest and support stays
around zero, I surely have other things to do.
Thanks,
Thomas
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: SPDX in the kernel: State of the union
2022-05-17 23:31 SPDX in the kernel: State of the union Thomas Gleixner
@ 2022-05-18 13:42 ` Allison Randal
2022-05-20 15:37 ` Thomas Gleixner
0 siblings, 1 reply; 5+ messages in thread
From: Allison Randal @ 2022-05-18 13:42 UTC (permalink / raw)
To: Thomas Gleixner, linux-spdx
On 5/17/22 7:31 PM, Thomas Gleixner wrote:
>
> Finally here is another round of important questions:
>
> #1 Is there still interest to get this done? The silence on this list
> after the initial effort is deafening.
This list had 210 messages in 2021, and 64 so far in 2022, which may be
silence compared to LKML, but is reasonably respectable ongoing traffic
for a small cleanup project.
I'm still reviewing all patches as they flow through this list. I
haven't been actively marking them as reviewed-by me, but I would raise
any problems I saw, and I've seen others raising problems.
So, yes, there's still interest, and if you want to start generating
more patches, I'll happily contribute to the review process.
I actually thought you just ran out of easily scriptable fixes, but it's
nice to hear that there's still substantially more we can do with
scancode rules.
> #2 Are there still enough interested and comptent people on this list to
> handle the legal questions?
I think so, yes. If we've lost some of our reviewers, we can recruit new
ones.
With the auto-generated patches, you will probably need to rate-limit
like you did in 2019, since the tools can generate patches far more
rapidly than the humans can review them.
> #3 Was there any progress on the outstanding questions on this list where
> discussion dried out almost 3 years ago?
ISTR reaching conclusions on all the questions before, but if there are
some lingering, we can revive them. And, you raised good new questions too.
> I'm willing to pull the cart again, but if the interest and support stays
> around zero, I surely have other things to do.
If you have the time and energy to do another burst, go for it. I don't
know that we'll ever get to 100%, but every file we clean up is helpful,
so it's worth continuing.
Allison
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: SPDX in the kernel: State of the union
2022-05-18 13:42 ` Allison Randal
@ 2022-05-20 15:37 ` Thomas Gleixner
2022-05-22 15:17 ` Allison Randal
0 siblings, 1 reply; 5+ messages in thread
From: Thomas Gleixner @ 2022-05-20 15:37 UTC (permalink / raw)
To: Allison Randal, linux-spdx
On Wed, May 18 2022 at 09:42, Allison Randal wrote:
> On 5/17/22 7:31 PM, Thomas Gleixner wrote:
> I actually thought you just ran out of easily scriptable fixes, but it's
> nice to hear that there's still substantially more we can do with
> scancode rules.
I ran out of cycles :)
> With the auto-generated patches, you will probably need to rate-limit
> like you did in 2019, since the tools can generate patches far more
> rapidly than the humans can review them.
Sure.
> If you have the time and energy to do another burst, go for it. I don't
> know that we'll ever get to 100%, but every file we clean up is helpful,
> so it's worth continuing.
I started to get some structure into this mess. For the first step I
excluded the Documentation directory unless files in that, which fit
into match rules applying to source files. I'll tend to the
Documentation directory in a seperate step.
Then I categorized the remaining match rules into the following:
Nr Category Rules Files affected
1 GPLv2[+] 141 1607
2 GPL unknown 84 1663
3 MIT 28 3275
4 GPLv2/MIT 2 36
5 BSD 20 114
6 GPL/BSD 32 1004
7 ISC 4 343
8 X11 1 3
9 Other 9 50
10 Unclear 63 916
11 Unknown 78 321
12 Nasty 16 48
13 Bogus 21 861
#1 Pretty clear GPLv2[or later] and LGPL matches.
#2 The nasty 'under GPL' ones. Quite some of them reference COPYING
#3-9 Pretty clear matches for MIT/BSD/ISC/X11/ZLIB and GPL combos of
those
#10 The unclear (at least to me) ones
#11 Licenses the kernel does not have (yet) in the LICENSES
directory, but some of them are not really clear to me
#12 GPL version 1 and version 3, reiserfs and some proprietary
#13 A set of bogosities in scancode which I need to discuss
with Philippe.
I probably made some mistakes here and there, but that's what I have
now.
I've generated static HTML pages from the data, which are available
here:
https://tglx.de/~tglx/spdx/index.html
so you can get a taste of what is coming to you sooner than later. The
categories link to pages with rules and the rules to a per rule details
page. The latter has links to a Linux cross reference site in case you
want to look at the real think instead of the 'normalized' match
patterns on the rule page.
My plan is to start with categories #1 and #3-9 and send out batches of
patches to the list.
Which size of batches and what rate do you folks prefer?
Thanks,
tglx
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: SPDX in the kernel: State of the union
2022-05-20 15:37 ` Thomas Gleixner
@ 2022-05-22 15:17 ` Allison Randal
2022-05-22 17:35 ` Thomas Gleixner
0 siblings, 1 reply; 5+ messages in thread
From: Allison Randal @ 2022-05-22 15:17 UTC (permalink / raw)
To: Thomas Gleixner, linux-spdx
On 5/20/22 11:37 AM, Thomas Gleixner wrote:
>
> I ran out of cycles :)
Nod, it happens. Really, it's a planet-wide phenomenon the past couple
of years. :)
> I started to get some structure into this mess.
[...]
> I've generated static HTML pages from the data, which are available
> here:
>
> https://tglx.de/~tglx/spdx/index.html
Makes sense, and a large number of them look like they'll be easy to
review and approve.
> Which size of batches and what rate do you folks prefer?
Looking back to 2019, you generally sent batches of 10-25 patches per
day, where each patch was one match rule. Seems reasonable to start
again there, and tune up or down as needed.
Allison
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: SPDX in the kernel: State of the union
2022-05-22 15:17 ` Allison Randal
@ 2022-05-22 17:35 ` Thomas Gleixner
0 siblings, 0 replies; 5+ messages in thread
From: Thomas Gleixner @ 2022-05-22 17:35 UTC (permalink / raw)
To: Allison Randal, linux-spdx
On Sun, May 22 2022 at 11:17, Allison Randal wrote:
> On 5/20/22 11:37 AM, Thomas Gleixner wrote:
>> I've generated static HTML pages from the data, which are available
>> here:
>>
>> https://tglx.de/~tglx/spdx/index.html
>
> Makes sense, and a large number of them look like they'll be easy to
> review and approve.
I hope so.
>> Which size of batches and what rate do you folks prefer?
>
> Looking back to 2019, you generally sent batches of 10-25 patches per
> day, where each patch was one match rule. Seems reasonable to start
> again there, and tune up or down as needed.
Sounds like a plan.
Thanks,
Thomas
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2022-05-22 17:36 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-05-17 23:31 SPDX in the kernel: State of the union Thomas Gleixner
2022-05-18 13:42 ` Allison Randal
2022-05-20 15:37 ` Thomas Gleixner
2022-05-22 15:17 ` Allison Randal
2022-05-22 17:35 ` Thomas Gleixner
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.