All of lore.kernel.org
 help / color / mirror / Atom feed
* Adding more information to the SBOM
@ 2022-09-14 14:16 Marta Rybczynska
  2022-09-14 14:56 ` Joshua Watt
  2022-09-15 12:16 ` Richard Purdie
  0 siblings, 2 replies; 11+ messages in thread
From: Marta Rybczynska @ 2022-09-14 14:16 UTC (permalink / raw)
  To: OE-core, openembedded-architecture, Joshua Watt

Dear all,
(cross-posting to oe-core and *-architecture)
In the last months, we have worked in Oniro on using the create-spdx
class for both IP compliance and security.

During this work, Alberto Pianon has found that some information is
missing from the SBOM and it does not contain enough for Software
Composition Analysis. The main missing point is the relation between
the actual upstream sources and the final binaries (create-spdx uses
composite sources).

Alberto has worked on how to obtain the missing data and now has a
POC. This POC provides full source-to-binary tracking of Yocto builds
through a couple of scripts (intended to be transformed into a new
bbclass at a later stage). The goal is to add the missing pieces of
information in order to get a "real" SBOM from Yocto, which should, at
a minimum:

- carefully describe what is found in a final image (i.e. binary files
and their dependencies), since that is what is actually distributed
and goes into the final product;
- describe how such binary files have been generated and where they
come from (i.e. upstream sources, including patches and other stuff
added from meta-layers); provenance is important for a number of
reasons related to IP Compliance and security.

The aim is to become able to:

- map binaries to their corresponding upstream source packages (and
not to the "internal" source packages created by recipes by combining
multiple upstream sources and patches)
- map binaries to the source files that have been actually used to
build them - which usually are a small subset of the whole source
package

With respect to IP compliance, this would allow to, among other things:

- get the real license text for each binary file, by getting the
license of the specific source files it has been generated from
(provided by Fossology, for instance), - and not the main license
stated in the corresponding recipe (which may be as confusing as
GPL-2.0-or-later & LGPL-2.1-or-later & BSD-3-Clause & BSD-4-Clause, or
even worse)
- automatically check license incompatibilities at the binary file level.

Other possible interesting things could be done also on the security side.

This work intends to add a way to provide additional data that can be
used by create-spdx, not to replace create-spdx in any way.

The sources with a long README are available at
https://gitlab.eclipse.org/eclipse/oniro-compliancetoolchain/toolchain/tinfoilhat/-/tree/srctracker/srctracker

What do you think of this work? Would it be of interest to integrate
into YP at some point? Shall we discuss this?

Marta and Alberto


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Adding more information to the SBOM
  2022-09-14 14:16 Adding more information to the SBOM Marta Rybczynska
@ 2022-09-14 14:56 ` Joshua Watt
  2022-09-14 17:10   ` [OE-core] " Alberto Pianon
  2022-09-15  1:16   ` [Openembedded-architecture] " Mark Hatle
  2022-09-15 12:16 ` Richard Purdie
  1 sibling, 2 replies; 11+ messages in thread
From: Joshua Watt @ 2022-09-14 14:56 UTC (permalink / raw)
  To: Marta Rybczynska; +Cc: OE-core, openembedded-architecture

On Wed, Sep 14, 2022 at 9:16 AM Marta Rybczynska <rybczynska@gmail.com> wrote:
>
> Dear all,
> (cross-posting to oe-core and *-architecture)
> In the last months, we have worked in Oniro on using the create-spdx
> class for both IP compliance and security.
>
> During this work, Alberto Pianon has found that some information is
> missing from the SBOM and it does not contain enough for Software
> Composition Analysis. The main missing point is the relation between
> the actual upstream sources and the final binaries (create-spdx uses
> composite sources).

I believe we map the binaries to the source code from the -dbg
packages; is the premise that this is insufficient? Can you elaborate
more on why that is, I don't quite understand. The debug sources are
(basically) what we actually compiled (e.g. post-do_patch) to produce
the binary, and you can in turn follow these back to the upstream
sources with the downloadLocation property.

>
> Alberto has worked on how to obtain the missing data and now has a
> POC. This POC provides full source-to-binary tracking of Yocto builds
> through a couple of scripts (intended to be transformed into a new
> bbclass at a later stage). The goal is to add the missing pieces of
> information in order to get a "real" SBOM from Yocto, which should, at
> a minimum:

Please be a little careful with the wording; SBoMs have a lot of uses,
and many of them we can satisfy with what we currently generate; it
may not do the exact use case you are looking for, but that doesn't
mean it's not a "real" SBoM :)

>
> - carefully describe what is found in a final image (i.e. binary files
> and their dependencies), since that is what is actually distributed
> and goes into the final product;
> - describe how such binary files have been generated and where they
> come from (i.e. upstream sources, including patches and other stuff
> added from meta-layers); provenance is important for a number of
> reasons related to IP Compliance and security.
>
> The aim is to become able to:
>
> - map binaries to their corresponding upstream source packages (and
> not to the "internal" source packages created by recipes by combining
> multiple upstream sources and patches)
> - map binaries to the source files that have been actually used to
> build them - which usually are a small subset of the whole source
> package
>
> With respect to IP compliance, this would allow to, among other things:
>
> - get the real license text for each binary file, by getting the
> license of the specific source files it has been generated from
> (provided by Fossology, for instance), - and not the main license
> stated in the corresponding recipe (which may be as confusing as
> GPL-2.0-or-later & LGPL-2.1-or-later & BSD-3-Clause & BSD-4-Clause, or
> even worse)

IIUC this is the difference between the "Declared" license and the
"Concluded" license. You can report both, and I think
create-spdx.bbclass can currently do this with its rudimentary source
license scanning. You really do want both and it's a great way to make
sure that the "Declared" license (that is the license in the recipe)
reflects the reality of the source code.

> - automatically check license incompatibilities at the binary file level.
>
> Other possible interesting things could be done also on the security side.
>
> This work intends to add a way to provide additional data that can be
> used by create-spdx, not to replace create-spdx in any way.
>
> The sources with a long README are available at
> https://gitlab.eclipse.org/eclipse/oniro-compliancetoolchain/toolchain/tinfoilhat/-/tree/srctracker/srctracker
>
> What do you think of this work? Would it be of interest to integrate
> into YP at some point? Shall we discuss this?

This seems promising as something that could potentially move into
core. I have a few points:
 - The extraction of the sources to a dedicated directory is something
that Richard has been toying around with for quite a while, and I
think it would greatly simplify that part of your process. I would
very much encourage you to look at the work he's done, and work on
that to get it pushed across the finish line as it's a really good
improvement that would benefit not just your source scanning.
 - I would encourage you to not wait to turn this into a bbclass
and/or library functions. You should be able to do this in a new
layer, and that would make it much clearer as to what the path to
being included in OE-core would look like. It also would (IMHO) be
nicer to the users :)

>
> Marta and Alberto


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [OE-core] Adding more information to the SBOM
  2022-09-14 14:56 ` Joshua Watt
@ 2022-09-14 17:10   ` Alberto Pianon
  2022-09-14 20:52     ` Joshua Watt
  2022-09-15  1:16   ` [Openembedded-architecture] " Mark Hatle
  1 sibling, 1 reply; 11+ messages in thread
From: Alberto Pianon @ 2022-09-14 17:10 UTC (permalink / raw)
  To: Joshua Watt; +Cc: Marta Rybczynska, OE-core, openembedded-architecture

Hi Joshua,

nice to meet you!

I'm new to this list, and I've always approached Yocto just from the
"IP compliance side", so I may miss important pieces of information. 
That
is why Marta encouraged me and is helping me to ask community feedback.

Il 2022-09-14 16:56 Joshua Watt ha scritto:
> On Wed, Sep 14, 2022 at 9:16 AM Marta Rybczynska <rybczynska@gmail.com> 
> wrote:
>> 
>> Dear all,
>> (cross-posting to oe-core and *-architecture)
>> In the last months, we have worked in Oniro on using the create-spdx
>> class for both IP compliance and security.
>> 
>> During this work, Alberto Pianon has found that some information is
>> missing from the SBOM and it does not contain enough for Software
>> Composition Analysis. The main missing point is the relation between
>> the actual upstream sources and the final binaries (create-spdx uses
>> composite sources).
> 
> I believe we map the binaries to the source code from the -dbg
> packages; is the premise that this is insufficient? Can you elaborate
> more on why that is, I don't quite understand. The debug sources are
> (basically) what we actually compiled (e.g. post-do_patch) to produce
> the binary, and you can in turn follow these back to the upstream
> sources with the downloadLocation property.

This was also my assumption at the beginning. But then I found that 
there
are recipes with multiple upstream sources, which may be combined/mixed
together in recipes' WORKDIR. For instance this one:

https://git.yoctoproject.org/meta-virtualization/tree/recipes-networking/cni/cni_git.bb

SRC_URI = "\
	git://github.com/containernetworking/cni.git;branch=main;name=cni;protocol=https 
\
         
git://github.com/containernetworking/plugins.git;branch=release-1.1;destsuffix=${S}/src/github.com/containernetworking/plugins;name=plugins;protocol=https 
\
         
git://github.com/flannel-io/cni-plugin;branch=main;name=flannel_plugin;protocol=https;destsuffix=${S}/src/github.com/containernetworking/plugins/plugins/meta/flannel 
\
	"

(The third source is unpacked in a subdir of the second one)

 From here I discovered that we can't assume that the first non-local URI
is the downloadLocation for all source files, because it is not always
the case.

Moreover, in the context of our project we also needed to find the 
upstream
sources also for local patches, scripts, etc. added by recipes (i.e. the
corresponding layers' repos).

> 
>> 
>> Alberto has worked on how to obtain the missing data and now has a
>> POC. This POC provides full source-to-binary tracking of Yocto builds
>> through a couple of scripts (intended to be transformed into a new
>> bbclass at a later stage). The goal is to add the missing pieces of
>> information in order to get a "real" SBOM from Yocto, which should, at
>> a minimum:
> 
> Please be a little careful with the wording; SBoMs have a lot of uses,
> and many of them we can satisfy with what we currently generate; it
> may not do the exact use case you are looking for, but that doesn't
> mean it's not a "real" SBoM :)

You are right, sorry! "real" is meant in the context of our project,
where we need to make our Fossology Audit Team work on "original"
upstream source packages/repos, for a number of reasons (the main being
that in Oniro project we have a complex build matrix with a lot of
available target machines and quite a number of different overrides
depending on the machine, so when it comes to IP compliance we need to
aggregate and simplify, otherwise our IP auditors would die :) )

But since our Audit Team, differently from a commercial project,
is working fully in the open, also other projects may benefit
from this approach: having fully reviewed file-level license
data publicly available for quite a number of upstream sources and
Yocto layers, a complete source-to-binary tracking system would
enable any Yocto projects to get very detailed license information
for their images, to automatically detect license incompatibilities
between linked binary files, etc.

> 
>> 
>> - carefully describe what is found in a final image (i.e. binary files
>> and their dependencies), since that is what is actually distributed
>> and goes into the final product;
>> - describe how such binary files have been generated and where they
>> come from (i.e. upstream sources, including patches and other stuff
>> added from meta-layers); provenance is important for a number of
>> reasons related to IP Compliance and security.
>> 
>> The aim is to become able to:
>> 
>> - map binaries to their corresponding upstream source packages (and
>> not to the "internal" source packages created by recipes by combining
>> multiple upstream sources and patches)
>> - map binaries to the source files that have been actually used to
>> build them - which usually are a small subset of the whole source
>> package
>> 
>> With respect to IP compliance, this would allow to, among other 
>> things:
>> 
>> - get the real license text for each binary file, by getting the
>> license of the specific source files it has been generated from
>> (provided by Fossology, for instance), - and not the main license
>> stated in the corresponding recipe (which may be as confusing as
>> GPL-2.0-or-later & LGPL-2.1-or-later & BSD-3-Clause & BSD-4-Clause, or
>> even worse)
> 
> IIUC this is the difference between the "Declared" license and the
> "Concluded" license. You can report both, and I think
> create-spdx.bbclass can currently do this with its rudimentary source
> license scanning. You really do want both and it's a great way to make
> sure that the "Declared" license (that is the license in the recipe)
> reflects the reality of the source code.
> 

The issue is with components like util-linux, which contains a lot of
sub-components subject to different licenses; util-linux recipe's
license is "GPL-2.0-or-later & LGPL-2.1-or-later & BSD-3-Clause &
BSD-4-Clause", but from such information one cannot tell if a particular
binary file generated from util-linux is subject to GPL, LGPL, or
BSD-3|4-clause.

Of course, being able to track upstream sources to binaries at file
level would be useless if one doesn't have file-level license 
information;
but since Scancode and Fossology (and our Audit Team) may provide such
information, such tracking may become super-useful, in our opinion.


>> - automatically check license incompatibilities at the binary file 
>> level.
>> 
>> Other possible interesting things could be done also on the security 
>> side.
>> 
>> This work intends to add a way to provide additional data that can be
>> used by create-spdx, not to replace create-spdx in any way.
>> 
>> The sources with a long README are available at
>> https://gitlab.eclipse.org/eclipse/oniro-compliancetoolchain/toolchain/tinfoilhat/-/tree/srctracker/srctracker
>> 
>> What do you think of this work? Would it be of interest to integrate
>> into YP at some point? Shall we discuss this?
> 
> This seems promising as something that could potentially move into
> core. I have a few points:
>  - The extraction of the sources to a dedicated directory is something
> that Richard has been toying around with for quite a while, and I
> think it would greatly simplify that part of your process. I would
> very much encourage you to look at the work he's done, and work on
> that to get it pushed across the finish line as it's a really good
> improvement that would benefit not just your source scanning.

Thanks for the suggestion, could you point me to Richard's work?
I'll surely look into it.

>  - I would encourage you to not wait to turn this into a bbclass
> and/or library functions. You should be able to do this in a new
> layer, and that would make it much clearer as to what the path to
> being included in OE-core would look like. It also would (IMHO) be
> nicer to the users :)

Understood :)

I'm the newbie here, so any other suggestion is warmly welcome.

Regards,

Alberto


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [OE-core] Adding more information to the SBOM
  2022-09-14 17:10   ` [OE-core] " Alberto Pianon
@ 2022-09-14 20:52     ` Joshua Watt
  0 siblings, 0 replies; 11+ messages in thread
From: Joshua Watt @ 2022-09-14 20:52 UTC (permalink / raw)
  To: Alberto Pianon; +Cc: Marta Rybczynska, OE-core, openembedded-architecture

On Wed, Sep 14, 2022 at 12:10 PM Alberto Pianon <alberto@pianon.eu> wrote:
>
> Hi Joshua,
>
> nice to meet you!
>
> I'm new to this list, and I've always approached Yocto just from the
> "IP compliance side", so I may miss important pieces of information.
> That
> is why Marta encouraged me and is helping me to ask community feedback.
>
> Il 2022-09-14 16:56 Joshua Watt ha scritto:
> > On Wed, Sep 14, 2022 at 9:16 AM Marta Rybczynska <rybczynska@gmail.com>
> > wrote:
> >>
> >> Dear all,
> >> (cross-posting to oe-core and *-architecture)
> >> In the last months, we have worked in Oniro on using the create-spdx
> >> class for both IP compliance and security.
> >>
> >> During this work, Alberto Pianon has found that some information is
> >> missing from the SBOM and it does not contain enough for Software
> >> Composition Analysis. The main missing point is the relation between
> >> the actual upstream sources and the final binaries (create-spdx uses
> >> composite sources).
> >
> > I believe we map the binaries to the source code from the -dbg
> > packages; is the premise that this is insufficient? Can you elaborate
> > more on why that is, I don't quite understand. The debug sources are
> > (basically) what we actually compiled (e.g. post-do_patch) to produce
> > the binary, and you can in turn follow these back to the upstream
> > sources with the downloadLocation property.
>
> This was also my assumption at the beginning. But then I found that
> there
> are recipes with multiple upstream sources, which may be combined/mixed
> together in recipes' WORKDIR. For instance this one:
>
> https://git.yoctoproject.org/meta-virtualization/tree/recipes-networking/cni/cni_git.bb
>
> SRC_URI = "\
>         git://github.com/containernetworking/cni.git;branch=main;name=cni;protocol=https
> \
>
> git://github.com/containernetworking/plugins.git;branch=release-1.1;destsuffix=${S}/src/github.com/containernetworking/plugins;name=plugins;protocol=https
> \
>
> git://github.com/flannel-io/cni-plugin;branch=main;name=flannel_plugin;protocol=https;destsuffix=${S}/src/github.com/containernetworking/plugins/plugins/meta/flannel
> \
>         "
>
> (The third source is unpacked in a subdir of the second one)
>
>  From here I discovered that we can't assume that the first non-local URI
> is the downloadLocation for all source files, because it is not always
> the case.

This is true, but I think that's more of a problem with the inability
to express multiple download locations in the SPDX, not that we don't
have all the source when we generate the SPDX, correct? I _beleive_
the -dbg package still contains all the source code from all three
URLs?

>
> Moreover, in the context of our project we also needed to find the
> upstream
> sources also for local patches, scripts, etc. added by recipes (i.e. the
> corresponding layers' repos).

Ok, so this makes me wonder: If we implement the better source
extraction in OE core, does that help this problem? Is the primary
problem that you want the unpatched upstream source code files instead
of the patched ones, or is it some other problem?

AFAIK, the -dbg package contains the source code we actually
compiled..... so I have a hard time understanding what's "incorrect"
(or not ideal) about referencing it; but I think I'm missing something
important :)

>
> >
> >>
> >> Alberto has worked on how to obtain the missing data and now has a
> >> POC. This POC provides full source-to-binary tracking of Yocto builds
> >> through a couple of scripts (intended to be transformed into a new
> >> bbclass at a later stage). The goal is to add the missing pieces of
> >> information in order to get a "real" SBOM from Yocto, which should, at
> >> a minimum:
> >
> > Please be a little careful with the wording; SBoMs have a lot of uses,
> > and many of them we can satisfy with what we currently generate; it
> > may not do the exact use case you are looking for, but that doesn't
> > mean it's not a "real" SBoM :)
>
> You are right, sorry! "real" is meant in the context of our project,
> where we need to make our Fossology Audit Team work on "original"
> upstream source packages/repos, for a number of reasons (the main being
> that in Oniro project we have a complex build matrix with a lot of
> available target machines and quite a number of different overrides
> depending on the machine, so when it comes to IP compliance we need to
> aggregate and simplify, otherwise our IP auditors would die :) )
>
> But since our Audit Team, differently from a commercial project,
> is working fully in the open, also other projects may benefit
> from this approach: having fully reviewed file-level license
> data publicly available for quite a number of upstream sources and
> Yocto layers, a complete source-to-binary tracking system would
> enable any Yocto projects to get very detailed license information
> for their images, to automatically detect license incompatibilities
> between linked binary files, etc.

Ok, so let me see if I can follow what you want here:
 1) Your Audit Team scans some open source repository, and generates
some sort of license report for it
 2) You do a Yocto build that builds that repository
 3) You want to link the SBoM generated by Yocto back to the report
from the Audit Team; specifically, you want be able to trace binaries
in the system back to the original source code from Audit Team report?

Currently #3 is difficult because
 1) Yocto only reports one SRC_URI in the SBoM
 2) Binary are tracked back to the as the patched source code (in the
-dbg packages), so the checksums may not match the original upstream
source code
Any other reasons?

>
> >
> >>
> >> - carefully describe what is found in a final image (i.e. binary files
> >> and their dependencies), since that is what is actually distributed
> >> and goes into the final product;
> >> - describe how such binary files have been generated and where they
> >> come from (i.e. upstream sources, including patches and other stuff
> >> added from meta-layers); provenance is important for a number of
> >> reasons related to IP Compliance and security.
> >>
> >> The aim is to become able to:
> >>
> >> - map binaries to their corresponding upstream source packages (and
> >> not to the "internal" source packages created by recipes by combining
> >> multiple upstream sources and patches)
> >> - map binaries to the source files that have been actually used to
> >> build them - which usually are a small subset of the whole source
> >> package
> >>
> >> With respect to IP compliance, this would allow to, among other
> >> things:
> >>
> >> - get the real license text for each binary file, by getting the
> >> license of the specific source files it has been generated from
> >> (provided by Fossology, for instance), - and not the main license
> >> stated in the corresponding recipe (which may be as confusing as
> >> GPL-2.0-or-later & LGPL-2.1-or-later & BSD-3-Clause & BSD-4-Clause, or
> >> even worse)
> >
> > IIUC this is the difference between the "Declared" license and the
> > "Concluded" license. You can report both, and I think
> > create-spdx.bbclass can currently do this with its rudimentary source
> > license scanning. You really do want both and it's a great way to make
> > sure that the "Declared" license (that is the license in the recipe)
> > reflects the reality of the source code.
> >
>
> The issue is with components like util-linux, which contains a lot of
> sub-components subject to different licenses; util-linux recipe's
> license is "GPL-2.0-or-later & LGPL-2.1-or-later & BSD-3-Clause &
> BSD-4-Clause", but from such information one cannot tell if a particular
> binary file generated from util-linux is subject to GPL, LGPL, or
> BSD-3|4-clause.
>
> Of course, being able to track upstream sources to binaries at file
> level would be useless if one doesn't have file-level license
> information;
> but since Scancode and Fossology (and our Audit Team) may provide such
> information, such tracking may become super-useful, in our opinion.

We also implement (and report) some rudimentary license scanning in
Yocto, but we only look for "SPDX-License-Identifier" tags



>
>
> >> - automatically check license incompatibilities at the binary file
> >> level.
> >>
> >> Other possible interesting things could be done also on the security
> >> side.
> >>
> >> This work intends to add a way to provide additional data that can be
> >> used by create-spdx, not to replace create-spdx in any way.
> >>
> >> The sources with a long README are available at
> >> https://gitlab.eclipse.org/eclipse/oniro-compliancetoolchain/toolchain/tinfoilhat/-/tree/srctracker/srctracker
> >>
> >> What do you think of this work? Would it be of interest to integrate
> >> into YP at some point? Shall we discuss this?
> >
> > This seems promising as something that could potentially move into
> > core. I have a few points:
> >  - The extraction of the sources to a dedicated directory is something
> > that Richard has been toying around with for quite a while, and I
> > think it would greatly simplify that part of your process. I would
> > very much encourage you to look at the work he's done, and work on
> > that to get it pushed across the finish line as it's a really good
> > improvement that would benefit not just your source scanning.
>
> Thanks for the suggestion, could you point me to Richard's work?
> I'll surely look into it.
>
> >  - I would encourage you to not wait to turn this into a bbclass
> > and/or library functions. You should be able to do this in a new
> > layer, and that would make it much clearer as to what the path to
> > being included in OE-core would look like. It also would (IMHO) be
> > nicer to the users :)
>
> Understood :)
>
> I'm the newbie here, so any other suggestion is warmly welcome.
>
> Regards,
>
> Alberto


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Openembedded-architecture] Adding more information to the SBOM
  2022-09-14 14:56 ` Joshua Watt
  2022-09-14 17:10   ` [OE-core] " Alberto Pianon
@ 2022-09-15  1:16   ` Mark Hatle
  1 sibling, 0 replies; 11+ messages in thread
From: Mark Hatle @ 2022-09-15  1:16 UTC (permalink / raw)
  To: Joshua Watt, Marta Rybczynska; +Cc: OE-core, openembedded-architecture



On 9/14/22 9:56 AM, Joshua Watt wrote:
> On Wed, Sep 14, 2022 at 9:16 AM Marta Rybczynska <rybczynska@gmail.com> wrote:
>>
>> Dear all,
>> (cross-posting to oe-core and *-architecture)
>> In the last months, we have worked in Oniro on using the create-spdx
>> class for both IP compliance and security.
>>
>> During this work, Alberto Pianon has found that some information is
>> missing from the SBOM and it does not contain enough for Software
>> Composition Analysis. The main missing point is the relation between
>> the actual upstream sources and the final binaries (create-spdx uses
>> composite sources).
> 
> I believe we map the binaries to the source code from the -dbg
> packages; is the premise that this is insufficient? Can you elaborate
> more on why that is, I don't quite understand. The debug sources are
> (basically) what we actually compiled (e.g. post-do_patch) to produce
> the binary, and you can in turn follow these back to the upstream
> sources with the downloadLocation property.

When I last looked at this, it was critical that the analysis be:

binary -> patched & configured source (dbg package) -> how the sources were 
constructed.

As Joshua said above.  I believe all of the information is present for this as 
you can tie the binary (through debug symbols) back to the debug package.. and 
the source of the debug package back to the sources that constructed it via 
heuristics.  (If you enable the git patch mechanism.  It should even be possible 
to use git blame to find exactly what upstreams constructed the patched sources.

For generated content, it's more difficult -- but for those items usually there 
is a header which indicates what generated the content so other heuristics can 
be used.

>>
>> Alberto has worked on how to obtain the missing data and now has a
>> POC. This POC provides full source-to-binary tracking of Yocto builds
>> through a couple of scripts (intended to be transformed into a new
>> bbclass at a later stage). The goal is to add the missing pieces of
>> information in order to get a "real" SBOM from Yocto, which should, at
>> a minimum:
> 
> Please be a little careful with the wording; SBoMs have a lot of uses,
> and many of them we can satisfy with what we currently generate; it
> may not do the exact use case you are looking for, but that doesn't
> mean it's not a "real" SBoM :)
> 
>>
>> - carefully describe what is found in a final image (i.e. binary files
>> and their dependencies), since that is what is actually distributed
>> and goes into the final product;
>> - describe how such binary files have been generated and where they
>> come from (i.e. upstream sources, including patches and other stuff
>> added from meta-layers); provenance is important for a number of
>> reasons related to IP Compliance and security.

Full compliance will require binaries mapped to patched source to upstream 
sources _AND_ the instructions (layer/recipe/configuration) used to build them. 
  But it's up to the local legal determination to figure out 'how far you really 
need to go', vs just "here are the layers I used to build my project".)

>> The aim is to become able to:
>>
>> - map binaries to their corresponding upstream source packages (and
>> not to the "internal" source packages created by recipes by combining
>> multiple upstream sources and patches)
>> - map binaries to the source files that have been actually used to
>> build them - which usually are a small subset of the whole source
>> package
>>
>> With respect to IP compliance, this would allow to, among other things:
>>
>> - get the real license text for each binary file, by getting the
>> license of the specific source files it has been generated from
>> (provided by Fossology, for instance), - and not the main license
>> stated in the corresponding recipe (which may be as confusing as
>> GPL-2.0-or-later & LGPL-2.1-or-later & BSD-3-Clause & BSD-4-Clause, or
>> even worse)
> 
> IIUC this is the difference between the "Declared" license and the
> "Concluded" license. You can report both, and I think
> create-spdx.bbclass can currently do this with its rudimentary source
> license scanning. You really do want both and it's a great way to make
> sure that the "Declared" license (that is the license in the recipe)
> reflects the reality of the source code.

And the thing to keep in mind is that in a given package the "Declared" is 
usually what a LICENSE file or header says.  But the "Concluded" has levels of 
quality behind them.  The first level of quality is "Declared".  The next level 
is automation (something like fossology), the next level is human reviewed, and 
the highest level is "lawyer reviewed".

So being able to inject SPDX information with Concluded values for evaluation 
and track the 'quality level' has always been something I wanted to do, but 
never had time.

At the time, my idea was a database (and/or bbappend) for each component that 
would included pre-processed SPDX data for each recipe.  This data would run 
through a validation step to show it actually matches the patched sources.  (If 
any file checksums do NOT match, then they would be flagged for follow up.)

>> - automatically check license incompatibilities at the binary file level.
>>
>> Other possible interesting things could be done also on the security side.
>>
>> This work intends to add a way to provide additional data that can be
>> used by create-spdx, not to replace create-spdx in any way.
>>
>> The sources with a long README are available at
>> https://gitlab.eclipse.org/eclipse/oniro-compliancetoolchain/toolchain/tinfoilhat/-/tree/srctracker/srctracker
>>
>> What do you think of this work? Would it be of interest to integrate
>> into YP at some point? Shall we discuss this?
> 
> This seems promising as something that could potentially move into
> core. I have a few points:
>   - The extraction of the sources to a dedicated directory is something
> that Richard has been toying around with for quite a while, and I
> think it would greatly simplify that part of your process. I would
> very much encourage you to look at the work he's done, and work on
> that to get it pushed across the finish line as it's a really good
> improvement that would benefit not just your source scanning.
>   - I would encourage you to not wait to turn this into a bbclass
> and/or library functions. You should be able to do this in a new
> layer, and that would make it much clearer as to what the path to
> being included in OE-core would look like. It also would (IMHO) be
> nicer to the users :)

Agreed, this looks useful.  The key is start turning it into one or more 
bbclasses now.  Things that work with the Yocto Project process.  Don't try to 
"post-process" and reconstruct sources.  Instead inject steps that will run your 
file checksums, build up your database as the source are constructed.  (i.e. 
do_unpack, do_patch..)

etc.

The key is, all of the information IS available.  It just may not be in the 
format you want.

--Mark

>>
>> Marta and Alberto
>>
>>
>> -=-=-=-=-=-=-=-=-=-=-=-
>> Links: You receive all messages sent to this group.
>> View/Reply Online (#1635): https://lists.openembedded.org/g/openembedded-architecture/message/1635
>> Mute This Topic: https://lists.openembedded.org/mt/93678489/3616948
>> Group Owner: openembedded-architecture+owner@lists.openembedded.org
>> Unsubscribe: https://lists.openembedded.org/g/openembedded-architecture/unsub [mark.hatle@kernel.crashing.org]
>> -=-=-=-=-=-=-=-=-=-=-=-
>>


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Openembedded-architecture] Adding more information to the SBOM
  2022-09-14 14:16 Adding more information to the SBOM Marta Rybczynska
  2022-09-14 14:56 ` Joshua Watt
@ 2022-09-15 12:16 ` Richard Purdie
  2022-09-16 15:18   ` Alberto Pianon
  1 sibling, 1 reply; 11+ messages in thread
From: Richard Purdie @ 2022-09-15 12:16 UTC (permalink / raw)
  To: Marta Rybczynska, OE-core, openembedded-architecture, Joshua Watt

On Wed, 2022-09-14 at 16:16 +0200, Marta Rybczynska wrote:
> The sources with a long README are available at
> https://gitlab.eclipse.org/eclipse/oniro-compliancetoolchain/toolchain/tinfoilhat/-/tree/srctracker/srctracker
> 
> What do you think of this work? Would it be of interest to integrate
> into YP at some point? Shall we discuss this?

I had a look at this and was a bit puzzled by some of it.

I can see the issues you'd have if you want to separate the unpatched
source from the patches and know which files had patches applied as
that is hard to track. There would be significiant overhead in trying
to process and store that information in the unpack/patch steps and the
archiver class does some of that already. It is messy, hard and doens't
perform well. I'm reluctant to force everyone to do it as a result but
that can also result in multiple code paths and when you have that, the
result is that one breaks :(.

I also can see the issue with multiple sources in SRC_URI, although you
should be able to map those back if you assume subtrees are "owned" by
given SRC_URI entries. I suspect there may be a SPDX format limit in
documenting that piece?

Where I became puzzled is where you say "Information about debug
sources for each actual binary file is then taken from
tmp/pkgdata/<machine>/extended/*.json.zstd". This is the data we added
and use for the spdx class so you shouldn't need to reinvent that
piece. It should be the exact same data the spdx class uses.

I was also puzzled about the difference between rpm and the other
package backends. The exact same files are packaged by all the package
backends so the checksums from do_package should be fine.


For the source issues above it basically it comes down to how much
"pain" we want to push onto all users for the sake of adding in this
data. Unfortunately it is data which many won't need or use and
different legal departments do have different requirements. Experience
with archiver.bbclass shows that multiple codepaths doing these things
is a nightmare to keep working, particularly for corner cases which do
interesting things with the code (externalsrc, gcc shared workdir, the
kernel and more).

Cheers,

Richard


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Openembedded-architecture] Adding more information to the SBOM
  2022-09-15 12:16 ` Richard Purdie
@ 2022-09-16 15:18   ` Alberto Pianon
  2022-09-16 15:49     ` Mark Hatle
  2022-09-16 16:08     ` Richard Purdie
  0 siblings, 2 replies; 11+ messages in thread
From: Alberto Pianon @ 2022-09-16 15:18 UTC (permalink / raw)
  To: Richard Purdie
  Cc: Marta Rybczynska, OE-core, openembedded-architecture,
	Joshua Watt, 'Carlo Piana',
	davide.ricci

Hi Richard,

thank you for your reply, you gave me very interesting cues to think
about. I'll reply in reverse/importance order

Il 2022-09-15 14:16 Richard Purdie wrote:
> 
> For the source issues above it basically it comes down to how much
> "pain" we want to push onto all users for the sake of adding in this
> data. Unfortunately it is data which many won't need or use and
> different legal departments do have different requirements.

We didn't paint the overall picture sufficiently well, therefore our
requirements may come across as coming from a particularly pedantic
legal department; my fault :)

Oniro is not "yet another commercial Yocto project", we are not a legal
department (even if we are experienced FLOSS lawyers and auditors, the
most prominent of whom is Carlo Piana -- cc'ed -- former general counsel
of FSFE and member of OSI Board).

Our rather ambitious goal is not limited to Oniro, and consists in doing
compliance in the open source way and both setting an example and
providing guidance and material for others to benefit from our effort.
Our work will therefore be shared (and possibly improved by others) not
only with Oniro-based projects but also with any Yocto project. Among
other things, the most relevant bit of work that we want to share is
**fully reviewed license information** and other legal metadata about a
whole bunch of open source components commonly used in Yocto projects.

To do that in a **scalable and fully automated way**, we need that Yocto
collects some information that is currently disposed of (or simply not
collected) at build time.

Oniro Project Leader, Davide Ricci - cc'ed - strongly encouraged us to
seek for feedback from you in order to find out the best way to do it.

Maybe organizing a call would be more convenient than discussing
background and requirements here, if you (and others) are available.


> Experience
> with archiver.bbclass shows that multiple codepaths doing these things
> is a nightmare to keep working, particularly for corner cases which do
> interesting things with the code (externalsrc, gcc shared workdir, the
> kernel and more).
> 
> I had a look at this and was a bit puzzled by some of it.
> 
> I can see the issues you'd have if you want to separate the unpatched
> source from the patches and know which files had patches applied as
> that is hard to track. There would be significiant overhead in trying
> to process and store that information in the unpack/patch steps and the
> archiver class does some of that already. It is messy, hard and doens't
> perform well. I'm reluctant to force everyone to do it as a result but
> that can also result in multiple code paths and when you have that, the
> result is that one breaks :(.
> 
> I also can see the issue with multiple sources in SRC_URI, although you
> should be able to map those back if you assume subtrees are "owned" by
> given SRC_URI entries. I suspect there may be a SPDX format limit in
> documenting that piece?

I'm replying in reverse order:

- there is a SPDX format limit, but it is by design: a SPDX package
   entity is a single sw distribution unit, so it may have only one
   downloadLocation; if you have more than one downloadLocation, you must
   have more than one SPDX package, according to SPDX specs;

- I understand that my solution is a bit hacky; but IMHO any other
   *post-mortem* solution would be far more hacky; the real solution
   would be collecting required information directly in do_fetch and
   do_unpack

- I also understand that we should reduce pain, otherwise nobody would
   use our solution; the simplest and cleanest way I can think about is
   collecting just package (in the SPDX sense) files' relative paths and
   checksums at every stage (fetch, unpack, patch, package), and leave
   data processing (i.e. mapping upstream source packages -> recipe's
   WORKDIR package -> debug source package -> binary packages -> binary
   image) to a separate tool, that may use (just a thought) a graph
   database to process things more efficiently.


> 
> Where I became puzzled is where you say "Information about debug
> sources for each actual binary file is then taken from
> tmp/pkgdata/<machine>/extended/*.json.zstd". This is the data we added
> and use for the spdx class so you shouldn't need to reinvent that
> piece. It should be the exact same data the spdx class uses.
> 

you're right, but in the context of a POC it was easier to extract them
directly from json files than from SPDX data :) It's just a POC to show
that required information may be retrieved in some way, implementation
details do not matter

> I was also puzzled about the difference between rpm and the other
> package backends. The exact same files are packaged by all the package
> backends so the checksums from do_package should be fine.
> 

Here I may miss some piece of information. I looked at files in
tmp/pkgdata but I couldn't find package file checksums anywhere: that is
why I parsed rpm packages. But if such checksums were already available
somewhere in tmp/pkgdata, it wouldn't be necessary to parse rpm packages
at all... Could you point me to what I'm (maybe) missing here? Thanks!

In any case, thank you much so for all your insights, they were
super-useful!

Cheers,

Alberto


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Openembedded-architecture] Adding more information to the SBOM
  2022-09-16 15:18   ` Alberto Pianon
@ 2022-09-16 15:49     ` Mark Hatle
  2022-09-20 12:25       ` Alberto Pianon
  2022-09-16 16:08     ` Richard Purdie
  1 sibling, 1 reply; 11+ messages in thread
From: Mark Hatle @ 2022-09-16 15:49 UTC (permalink / raw)
  To: Alberto Pianon, Richard Purdie
  Cc: Marta Rybczynska, OE-core, openembedded-architecture,
	Joshua Watt, 'Carlo Piana',
	davide.ricci



On 9/16/22 10:18 AM, Alberto Pianon wrote:

... trimmed ...

>> I also can see the issue with multiple sources in SRC_URI, although you
>> should be able to map those back if you assume subtrees are "owned" by
>> given SRC_URI entries. I suspect there may be a SPDX format limit in
>> documenting that piece?
> 
> I'm replying in reverse order:
> 
> - there is a SPDX format limit, but it is by design: a SPDX package
>     entity is a single sw distribution unit, so it may have only one
>     downloadLocation; if you have more than one downloadLocation, you must
>     have more than one SPDX package, according to SPDX specs;

I think my interpretation of this is different.  I've got a view of 'sourcing 
materials', and then verifying the are what we think they are and can be used 
the way we want.  The "upstream sources" (and patches) are really just 'raw 
materials' that we use the Yocto Project to combined to create "the source".

So for the purpose of the SPDX, each upstream source _may_ have a corresponding 
SPDX, but for the binaries their source is the combined unit.. not multiple 
SPDXes.  Think of it something like:

upstream source1 - SPDX
upstream source2 - SPDX
upstream patch
recipe patch1
recipe patch2

In the above, each of those items would be combined by the recipe system to 
construct the source used to build an individual recipe (and collection of 
packages).  Automation _IS_ used to combine the components [unpack/fetch] and 
_MAY_ be used to generated a combined SPDX.

So your "upstream" location for this recipe is the local machine's source 
archive.  The SPDX for the local recipe files can merge the SPDX information 
they know (and if it's at a file level) can use checksums to identify the items 
not captured/modified by the patches for further review (either manual or 
automation like fossology).  In the case where an upstream has SPDX data, you 
should be able to inherit MOST files this way... but the output is specific to 
your configuration and patches.

1 - SPDX |
2 - SPDX |
patch    |---> recipe specific SPDX
patch    |
patch    |

In some cases someone may want to generate SPDX data for the 3 patches, but that 
may or may not be useful in this context.

> - I understand that my solution is a bit hacky; but IMHO any other
>     *post-mortem* solution would be far more hacky; the real solution
>     would be collecting required information directly in do_fetch and
>     do_unpack

I've not looked at the current SPDX spec, but past versions has a notes section. 
  Assuming this is still present you can use it to reference back to how this 
component was constructed and the upstream source URIs (and SPDX files) you used 
for processing.

This way nothing really changes in do_fetch or do_unpack.  (You may want to find 
a way to capture file checksums and what the source was for a particular file.. 
but it may not really be necessary!)

> - I also understand that we should reduce pain, otherwise nobody would
>     use our solution; the simplest and cleanest way I can think about is
>     collecting just package (in the SPDX sense) files' relative paths and
>     checksums at every stage (fetch, unpack, patch, package), and leave
>     data processing (i.e. mapping upstream source packages -> recipe's
>     WORKDIR package -> debug source package -> binary packages -> binary
>     image) to a separate tool, that may use (just a thought) a graph
>     database to process things more efficiently.

Even it do_patch nothing really changes, other then again you may want to 
capture checksums to identify thingsthat need further processing.


This approach greatly simplifies things, and gives people doing code reviews the 
insight into what is the source used when shipping the binaries (which is really 
an important aspect of this), as well as which recipe and "build" (really 
fetch/unpack/patch) were used to construct the sources.  If they want to 
investigate the sources further back to their provider, then the notes would 
have the information for that, and you could transition back to the "raw 
materials" providers.

>>
>> Where I became puzzled is where you say "Information about debug
>> sources for each actual binary file is then taken from
>> tmp/pkgdata/<machine>/extended/*.json.zstd". This is the data we added
>> and use for the spdx class so you shouldn't need to reinvent that
>> piece. It should be the exact same data the spdx class uses.
>>
> 
> you're right, but in the context of a POC it was easier to extract them
> directly from json files than from SPDX data :) It's just a POC to show
> that required information may be retrieved in some way, implementation
> details do not matter
> 
>> I was also puzzled about the difference between rpm and the other
>> package backends. The exact same files are packaged by all the package
>> backends so the checksums from do_package should be fine.
>>
> 
> Here I may miss some piece of information. I looked at files in
> tmp/pkgdata but I couldn't find package file checksums anywhere: that is
> why I parsed rpm packages. But if such checksums were already available
> somewhere in tmp/pkgdata, it wouldn't be necessary to parse rpm packages
> at all... Could you point me to what I'm (maybe) missing here? Thanks!

file checksumming is expensive.  There are checksums available to individual 
packaging engines, as well as aggregate checksums for "hash equivalency".. but 
I'm not aware of any per-file checksum that is stored.

You definitely shouldn't be parsing packages of any type (rpm or otherwise), as 
packages are truly optional.  It's the binaries that matter here.

--Mark

> In any case, thank you much so for all your insights, they were
> super-useful!
> 
> Cheers,
> 
> Alberto
> 
> 
> 
> -=-=-=-=-=-=-=-=-=-=-=-
> Links: You receive all messages sent to this group.
> View/Reply Online (#1640): https://lists.openembedded.org/g/openembedded-architecture/message/1640
> Mute This Topic: https://lists.openembedded.org/mt/93678489/3616948
> Group Owner: openembedded-architecture+owner@lists.openembedded.org
> Unsubscribe: https://lists.openembedded.org/g/openembedded-architecture/unsub [mark.hatle@kernel.crashing.org]
> -=-=-=-=-=-=-=-=-=-=-=-
> 


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Openembedded-architecture] Adding more information to the SBOM
  2022-09-16 15:18   ` Alberto Pianon
  2022-09-16 15:49     ` Mark Hatle
@ 2022-09-16 16:08     ` Richard Purdie
       [not found]       ` <1061592967.5114533.1663597215958.JavaMail.zimbra@piana.eu>
  1 sibling, 1 reply; 11+ messages in thread
From: Richard Purdie @ 2022-09-16 16:08 UTC (permalink / raw)
  To: Alberto Pianon
  Cc: Marta Rybczynska, OE-core, openembedded-architecture,
	Joshua Watt, 'Carlo Piana',
	davide.ricci

On Fri, 2022-09-16 at 17:18 +0200, Alberto Pianon wrote:
> Il 2022-09-15 14:16 Richard Purdie wrote:
> > 
> > For the source issues above it basically it comes down to how much
> > "pain" we want to push onto all users for the sake of adding in this
> > data. Unfortunately it is data which many won't need or use and
> > different legal departments do have different requirements.
> 
> We didn't paint the overall picture sufficiently well, therefore our
> requirements may come across as coming from a particularly pedantic
> legal department; my fault :)
> 
> Oniro is not "yet another commercial Yocto project", we are not a legal
> department (even if we are experienced FLOSS lawyers and auditors, the
> most prominent of whom is Carlo Piana -- cc'ed -- former general counsel
> of FSFE and member of OSI Board).
> 
> Our rather ambitious goal is not limited to Oniro, and consists in doing
> compliance in the open source way and both setting an example and
> providing guidance and material for others to benefit from our effort.
> Our work will therefore be shared (and possibly improved by others) not
> only with Oniro-based projects but also with any Yocto project. Among
> other things, the most relevant bit of work that we want to share is
> **fully reviewed license information** and other legal metadata about a
> whole bunch of open source components commonly used in Yocto projects.

I certainly love the goal. I presume you're going to share your review
criteria somehow? There must be some further set of steps,
documentation and results beyond what we're discussing here?

I think the challenge will be whether you can publish that review with
sufficient "proof" that other legal departments can leverage it. I
wouldn't underestimate how different the requirements and process can
be between different people/teams/companies.

> To do that in a **scalable and fully automated way**, we need that Yocto
> collects some information that is currently disposed of (or simply not
> collected) at build time.
> 
> Oniro Project Leader, Davide Ricci - cc'ed - strongly encouraged us to
> seek for feedback from you in order to find out the best way to do it.
> 
> Maybe organizing a call would be more convenient than discussing
> background and requirements here, if you (and others) are available.

I don't mind having a call but the discussion in this current form may
have an important element we shouldn't overlook, which is that it isn't
just me you need to convince on some of this.

If, for example, we should radically change the unpack/patch process,
we need to have a good explanation for why people need to take that
build time/space/resource hit. If we conclude that on a call, the case
to the wider community would still have to be made.

> > Experience
> > with archiver.bbclass shows that multiple codepaths doing these things
> > is a nightmare to keep working, particularly for corner cases which do
> > interesting things with the code (externalsrc, gcc shared workdir, the
> > kernel and more).
> > 
> > I had a look at this and was a bit puzzled by some of it.
> > 
> > I can see the issues you'd have if you want to separate the unpatched
> > source from the patches and know which files had patches applied as
> > that is hard to track. There would be significiant overhead in trying
> > to process and store that information in the unpack/patch steps and the
> > archiver class does some of that already. It is messy, hard and doens't
> > perform well. I'm reluctant to force everyone to do it as a result but
> > that can also result in multiple code paths and when you have that, the
> > result is that one breaks :(.
> > 
> > I also can see the issue with multiple sources in SRC_URI, although you
> > should be able to map those back if you assume subtrees are "owned" by
> > given SRC_URI entries. I suspect there may be a SPDX format limit in
> > documenting that piece?
> 
> I'm replying in reverse order:
> 
> - there is a SPDX format limit, but it is by design: a SPDX package
>    entity is a single sw distribution unit, so it may have only one
>    downloadLocation; if you have more than one downloadLocation, you must
>    have more than one SPDX package, according to SPDX specs;

I think we may need to talk to the SPDX people about that as I'm not
convinced it always holds that you can divide software into such units.
Certainly you can construct a situation where there are two
repositories, each containing a source file which are only ever linked
together as one binary.

> - I understand that my solution is a bit hacky; but IMHO any other
>    *post-mortem* solution would be far more hacky; the real solution
>    would be collecting required information directly in do_fetch and
>    do_unpack

Agreed, this needs to be done at unpack/patch time. Don't underestimate
the impact of this on general users though as many won't appreciate
slowing down their builds generating this information :/.

There is also a pile of information some legal departments want which
you've not mentioned here, such as build scripts and configuration
information. Some previous discussions with other parts of the wider
open source community rejected Yocto Projects efforts as insufficient
since we didn't mandate and capture all of this too (the archiver could
optionally do some of it iirc). Is this just the first step and we're
going to continue dumping more data? Or is this sufficient and all any
legal department should need?

> - I also understand that we should reduce pain, otherwise nobody would
>    use our solution; the simplest and cleanest way I can think about is
>    collecting just package (in the SPDX sense) files' relative paths and
>    checksums at every stage (fetch, unpack, patch, package), and leave
>    data processing (i.e. mapping upstream source packages -> recipe's
>    WORKDIR package -> debug source package -> binary packages -> binary
>    image) to a separate tool, that may use (just a thought) a graph
>    database to process things more efficiently.

I'd suggest stepping back and working out whether the SPDX requirement
of a "single download location" some of this stems from really makes
sense.

> > Where I became puzzled is where you say "Information about debug
> > sources for each actual binary file is then taken from
> > tmp/pkgdata/<machine>/extended/*.json.zstd". This is the data we added
> > and use for the spdx class so you shouldn't need to reinvent that
> > piece. It should be the exact same data the spdx class uses.
> > 
> 
> you're right, but in the context of a POC it was easier to extract them
> directly from json files than from SPDX data :) It's just a POC to show
> that required information may be retrieved in some way, implementation
> details do not matter

Fair enough, I just want to be clear we don't want to duplicate this.

> 
> > I was also puzzled about the difference between rpm and the other
> > package backends. The exact same files are packaged by all the package
> > backends so the checksums from do_package should be fine.
> > 
> 
> Here I may miss some piece of information. I looked at files in
> tmp/pkgdata but I couldn't find package file checksums anywhere: that is
> why I parsed rpm packages. But if such checksums were already available
> somewhere in tmp/pkgdata, it wouldn't be necessary to parse rpm packages
> at all... Could you point me to what I'm (maybe) missing here? Thanks!

In some ways this is quite simple, it is because at do_package time,
the output packages don't exist, only their content. The final output
packages are generated in do_package_write_{ipk|deb|rpm}.

You'd probably have to add a stage to the package_write tasks which
wrote out more checksum data since the checksums are only known at the
end of those tasks. I would question whether adding this additional
checksum into the SPDX output actually helps much in the real world
though. I guess it means you could look an RPM up against it's checksum
but is that something people need to do?

Cheers,

Richard






^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Openembedded-architecture] Adding more information to the SBOM
  2022-09-16 15:49     ` Mark Hatle
@ 2022-09-20 12:25       ` Alberto Pianon
  0 siblings, 0 replies; 11+ messages in thread
From: Alberto Pianon @ 2022-09-20 12:25 UTC (permalink / raw)
  To: Mark Hatle
  Cc: Richard Purdie, Marta Rybczynska, OE-core,
	openembedded-architecture, Joshua Watt, 'Carlo Piana',
	davide.ricci


Il 2022-09-16 17:49 Mark Hatle wrote:
> On 9/16/22 10:18 AM, Alberto Pianon wrote:
> 
> ... trimmed ...
> 
>>> I also can see the issue with multiple sources in SRC_URI, although 
>>> you
>>> should be able to map those back if you assume subtrees are "owned" 
>>> by
>>> given SRC_URI entries. I suspect there may be a SPDX format limit in
>>> documenting that piece?
>> 
>> I'm replying in reverse order:
>> 
>> - there is a SPDX format limit, but it is by design: a SPDX package
>>     entity is a single sw distribution unit, so it may have only one
>>     downloadLocation; if you have more than one downloadLocation, you 
>> must
>>     have more than one SPDX package, according to SPDX specs;
> 
> I think my interpretation of this is different.  I've got a view of
> 'sourcing materials', and then verifying the are what we think they
> are and can be used the way we want.  The "upstream sources" (and
> patches) are really just 'raw materials' that we use the Yocto Project
> to combined to create "the source".
> 
> So for the purpose of the SPDX, each upstream source _may_ have a
> corresponding SPDX, but for the binaries their source is the combined
> unit.. not multiple SPDXes.  Think of it something like:
> 
> upstream source1 - SPDX
> upstream source2 - SPDX
> upstream patch
> recipe patch1
> recipe patch2
> 
> In the above, each of those items would be combined by the recipe
> system to construct the source used to build an individual recipe (and
> collection of packages).  Automation _IS_ used to combine the
> components [unpack/fetch] and _MAY_ be used to generated a combined
> SPDX.
> 
> So your "upstream" location for this recipe is the local machine's
> source archive.  The SPDX for the local recipe files can merge the
> SPDX information they know (and if it's at a file level) can use
> checksums to identify the items not captured/modified by the patches
> for further review (either manual or automation like fossology).  In
> the case where an upstream has SPDX data, you should be able to
> inherit MOST files this way... but the output is specific to your
> configuration and patches.
> 
> 1 - SPDX |
> 2 - SPDX |
> patch    |---> recipe specific SPDX
> patch    |
> patch    |
> 
> In some cases someone may want to generate SPDX data for the 3
> patches, but that may or may not be useful in this context.

IMHO it's a matter of different ways of framing Yocto recipes into SPDX
format.

Upstream sources are all SPDX packages. Yocto layers are SPDX packages,
too, containing some PATCH_FOR upstream packages.

Upstream sources and yocto layers are the "final" upstream sources, and
each of them has its downloadLocation.

"The source" created by a recipe is another SPDX package, GENERATED_FROM
upstream source packages + recipe and patches from Yocto layer
package(s). "The source" may need to be distributed by downstream users
(eg. to comply with *GPL-* obligations or when providing SDKs), so
downstream users may made it available from their own infrastructure,
"giving" it a downloadLocation.

(in SPDX, GENERATED_FROM and PATCH_FOR relationships may be between
files, so one may map files found in "the source" package to individual
files found in upstream source packages)

Binary packages GENERATED_FROM "the source" are local SPDX packages,
too. And firmware images are SPDX packages, too, GENERATED_FROM all the
above. Firmware images are distributed by downstream users, who will
provide their own downloadLocation.

> 
>> - I understand that my solution is a bit hacky; but IMHO any other
>>     *post-mortem* solution would be far more hacky; the real solution
>>     would be collecting required information directly in do_fetch and
>>     do_unpack
> 
> I've not looked at the current SPDX spec, but past versions has a
> notes section.  Assuming this is still present you can use it to
> reference back to how this component was constructed and the upstream
> source URIs (and SPDX files) you used for processing.
> 
> This way nothing really changes in do_fetch or do_unpack.  (You may
> want to find a way to capture file checksums and what the source was
> for a particular file.. but it may not really be necessary!)
> 

If you want to automatically map all files to their corresponding
upstram sources, it actually is... see my next point


>> - I also understand that we should reduce pain, otherwise nobody would
>>     use our solution; the simplest and cleanest way I can think about 
>> is
>>     collecting just package (in the SPDX sense) files' relative paths 
>> and
>>     checksums at every stage (fetch, unpack, patch, package), and 
>> leave
>>     data processing (i.e. mapping upstream source packages -> recipe's
>>     WORKDIR package -> debug source package -> binary packages -> 
>> binary
>>     image) to a separate tool, that may use (just a thought) a graph
>>     database to process things more efficiently.
> 
> Even it do_patch nothing really changes, other then again you may want
> to capture checksums to identify thingsthat need further processing.
> 
> 
> This approach greatly simplifies things, and gives people doing code
> reviews the insight into what is the source used when shipping the
> binaries (which is really an important aspect of this), as well as
> which recipe and "build" (really fetch/unpack/patch) were used to
> construct the sources.  If they want to investigate the sources
> further back to their provider, then the notes would have the
> information for that, and you could transition back to the "raw
> materials" providers.

The point is precisely that we would like to help people avoid doing
this job, because if you scale up to n different yocto projects it would
be a time-consuming, error-prone and hardly maintainable process. Since
SPDX allows to represent relationships between any kind of entities
(files, packages), we would like to use that feature to map local source
files to upstream source files, so machines may do the job instead of
people -- and people (auditors) may concentrate on reviewing upstream
sources -- i.e. the atomic ingredients used across different projects or
across different versions of the same project.


> 
>>> 
>>> Where I became puzzled is where you say "Information about debug
>>> sources for each actual binary file is then taken from
>>> tmp/pkgdata/<machine>/extended/*.json.zstd". This is the data we 
>>> added
>>> and use for the spdx class so you shouldn't need to reinvent that
>>> piece. It should be the exact same data the spdx class uses.
>>> 
>> 
>> you're right, but in the context of a POC it was easier to extract 
>> them
>> directly from json files than from SPDX data :) It's just a POC to 
>> show
>> that required information may be retrieved in some way, implementation
>> details do not matter
>> 
>>> I was also puzzled about the difference between rpm and the other
>>> package backends. The exact same files are packaged by all the 
>>> package
>>> backends so the checksums from do_package should be fine.
>>> 
>> 
>> Here I may miss some piece of information. I looked at files in
>> tmp/pkgdata but I couldn't find package file checksums anywhere: that 
>> is
>> why I parsed rpm packages. But if such checksums were already 
>> available
>> somewhere in tmp/pkgdata, it wouldn't be necessary to parse rpm 
>> packages
>> at all... Could you point me to what I'm (maybe) missing here? Thanks!
> 
> file checksumming is expensive.  There are checksums available to
> individual packaging engines, as well as aggregate checksums for "hash
> equivalency".. but I'm not aware of any per-file checksum that is
> stored.
> 
> You definitely shouldn't be parsing packages of any type (rpm or
> otherwise), as packages are truly optional.  It's the binaries that
> matter here.

You are definitely right. I guess that it should be done (optionally) in
do_package



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Openembedded-architecture] Adding more information to the SBOM
       [not found]       ` <1061592967.5114533.1663597215958.JavaMail.zimbra@piana.eu>
@ 2022-09-20 13:15         ` Richard Purdie
  0 siblings, 0 replies; 11+ messages in thread
From: Richard Purdie @ 2022-09-20 13:15 UTC (permalink / raw)
  To: Carlo Piana
  Cc: Alberto Pianon, Marta Rybczynska, OE-core,
	openembedded-architecture, Joshua Watt, davide ricci

On Mon, 2022-09-19 at 16:20 +0200, Carlo Piana wrote:
> thank you for a well detailed and sensible answer. I certainly cannot
> speak on technical issues, although I can understand there are
> activities which could seriously impact the overall process and need
> to be minimized.
> 
> 
> > On Fri, 2022-09-16 at 17:18 +0200, Alberto Pianon wrote:
> > > Il 2022-09-15 14:16 Richard Purdie wrote:
> > > > 
> > > > For the source issues above it basically it comes down to how much
> > > > "pain" we want to push onto all users for the sake of adding in this
> > > > data. Unfortunately it is data which many won't need or use and
> > > > different legal departments do have different requirements.
> > > 
> > > We didn't paint the overall picture sufficiently well, therefore our
> > > requirements may come across as coming from a particularly pedantic
> > > legal department; my fault :)
> > > 
> > > Oniro is not "yet another commercial Yocto project", we are not a legal
> > > department (even if we are experienced FLOSS lawyers and auditors, the
> > > most prominent of whom is Carlo Piana -- cc'ed -- former general counsel
> > > of FSFE and member of OSI Board).
> > > 
> > > Our rather ambitious goal is not limited to Oniro, and consists in doing
> > > compliance in the open source way and both setting an example and
> > > providing guidance and material for others to benefit from our effort.
> > > Our work will therefore be shared (and possibly improved by others) not
> > > only with Oniro-based projects but also with any Yocto project. Among
> > > other things, the most relevant bit of work that we want to share is
> > > **fully reviewed license information** and other legal metadata about a
> > > whole bunch of open source components commonly used in Yocto projects.
> > 
> > I certainly love the goal. I presume you're going to share your review
> > criteria somehow? There must be some further set of steps,
> > documentation and results beyond what we're discussing here?
> 
> Our mandate (and our own attitude) is precisely to make everything as
> public as possible.
> 
> We have published already about it
> https://gitlab.eclipse.org/eclipse/oniro-compliancetoolchain/toolchain/docs/-/tree/main/audit_workflow
> 
> The entire review process is made using GitLab's issues and will be
> made public.

I need to read into the details but that looks like a great start and
I'm happy to see the process being documented!

Thanks for the link, I'll try and have a read.

> We have only one reservation concerning sensitive material
> in case we found something legally problematic (to comply with
> attorney/client privilege) or security-wise critic (in which case we
> adopt a responsible disclosure principle and embargo some details).

That makes sense, it is a tricky balancing act at times.

> > I think the challenge will be whether you can publish that review with
> > sufficient "proof" that other legal departments can leverage it. I
> > wouldn't underestimate how different the requirements and process can
> > be between different people/teams/companies.
> 
> Speaking from a legal perspective, this is precisely the point. It is
> true that we want to create a curated database of decisions, which as
> any human enterprise is prone to errors and correction and therefore
> we cannot have the last word. However, IF we can at least point to a
> unique artifact and give its exact hash, there will be no need to
> trust us, that would be open to inspection, because everybody else
> could look at the same source we have identified and make sure we
> have extracted all the information.

I do love the idea and I think it is quite possible. I do think this
does lead to one of the key details we need to think about though.

From a legal perspective I'd imagine you like dealing with a set of
files that make up the source of some piece of software. I'm not going
to use the word "package" since I think the term is overloaded and
confusing. That set of files can all be identified by checksums. This
pushes us towards wanting checksums of every file.

Stepping over to the build world, we have bitbake's fetcher and it
actually requires something similar - any given "input" must be
uniquely identifiable from the SRC_URI and possibly a set of SRCREVs.

Why? We firstly need to embed this information into the task signature.
If it changes, we know we need to rerun the fetch and re-obtain the
data. We work on inputs to generate this hash, not outputs and we
require all fetcher modules to be able to identify sources like this.

In the case of a git repo, the hash of a git commit is good enough. For
a tarball, it would be a checksum of the tarball. Where there are patch
local files, we include the hashes of those files.

The bottom line is that we already have a hash which represents the
task inputs. Bugs happen, sure. There are also poor fetchers, npm and
go present challenges in particular but we've tried to work around
those issues.

What you're saying is that you don't trust what bitbake does, so you
want all the next level of information about the individual files.

In theory we could put the SRC_URI and SRCREVs into the SPDX as the
source (which could be summarised into a task hash) rather than the
upstream url. It all depends which level you want to break things down
to.

I do see a case for needing the lower level info as in review, you are
going to want to know the delta to the last review decisions. You also
prefer having a different "upstream" url form for some kinds of checks
like CVEs. It does feel a lot like we're trying to duplicate
information and cause significant growth of the SPDX files without an
actual definitive need.

You could equally put in a mapping between a fetch task checksum and
the checksums of all the files that fetch task would expand to if run
(it should always do it deterministicly).

> To be clearer, we are not discussing here the obligation to provide
> the entire corresponding source code as with *GPLv3, but rather we
> are seeking to establish the *provenance* of the software, of all
> bits (also in order to see what patch has been applied by who and to
> close which vulnerability, in case).

My worry is that by not considering the obligation, we don't cater for
a portion of the userbase and by doing so, we limit the possible
adoption.

> Provenance also has a great impact on "reproducibility" of legal work
> on sources. If we are not able to tell what has gone into our package
> from where (and this may prove hard and require a lot of manual - and
> therefore error-prone - work especially in case of complex Yocto
> recipes using f.e. crate/cargo or npm(sw) fetchers), we (lawyers and
> compliance specialists) are at a great disadvantage proving we have
> covered all our bases.

I understand this more than you realise as we have the same problem in
the bitbake fetcher and have spent a lot of time trying to solve it. I
won't claim we're there for some of the modern runtimes and I'd love
help in both explaining to the upstream projects why we need this and
help to technically fix the fetchers so these modern runtimes work
better.

> This is a very good point, and I can vouch that this is really
> important, but maybe you are reading too much in here: at this stage,
> our goal is not to convince anyone to radically change Yocto tasks to
> meet our requirements, but it is to share such requirements and their
> rationale, collect your feedback and possibly adjust them, and also
> to figure out the least impactful solution to meet them (possibly
> without radical changes but just by adding optional functions in
> existing tasks).

"optional functions" fill me with dread, this is the archiver problem I
mentioned.

One of the things I try really hard to do is to have one good way of
doing things rather than multiple options with different levels of
functionality. If you give people choices, they use them. When
someone's build fails, I don't want to have to ask "which fetcher were
you using? Did you configure X or Y or Z?". If we can all use the same
code and codepaths, it means we see bugs, we see regressions and we
have a common experience without the need for complex test matrices.

Worst case you can add optional functions but I kind of see that as a
failure. If we can find something with low overhead which we can all
use, that would be much better. Whether it is possible, I don't know
but it is why we're having the discussion. This is why I have a
preference for trying to keep common code paths for the core though.

> > > - I understand that my solution is a bit hacky; but IMHO any other
> > >    *post-mortem* solution would be far more hacky; the real solution
> > >    would be collecting required information directly in do_fetch and
> > >    do_unpack
> > 
> > Agreed, this needs to be done at unpack/patch time. Don't underestimate
> > the impact of this on general users though as many won't appreciate
> > slowing down their builds generating this information :/.
> 
> Can't this be made optional, so one could just go for the "old" way
> without impacting much? Sorry I'm stepping where I'm naive.

See above :).

> 
> > 
> > There is also a pile of information some legal departments want which
> > you've not mentioned here, such as build scripts and configuration
> > information. Some previous discussions with other parts of the wider
> > open source community rejected Yocto Projects efforts as insufficient
> > since we didn't mandate and capture all of this too (the archiver could
> > optionally do some of it iirc). Is this just the first step and we're
> > going to continue dumping more data? Or is this sufficient and all any
> > legal department should need?
> > 
> 
> I think that trying to give all legal departments what they want
> would prove impossible. I think the idea here is more to start
> building a collectively managed database of provenance and licensing
> data, with a curate set of decision for as many packages available as
> possible. This way that everybody can have some good clue -- and
> increasingly a better one -- as to which license(s) apply to which
> package, removing much of the guesswork that is required today.

It makes sense and is a worthy goal. I just wish we could key this off
bitbake's fetch task checksum rather than having to dump reams of file
checksums!

> We ourselves reuse a lot of information coming from Debian's machine-
> readable information, sometimes finding mistakes and opening issues
> upstream. That helped us to cut the licensing harvesting information
> and review by a great deal.

This does explain why the bitbake fetch mechanism would be a struggle
for you though as you don't want to use our fetch units as your base
component (which is why we end up struggling with some of the issues).

In the interests of moving towards a conclusion, I think what we'll end
up needing to do is generate more information from the fetch and patch
tasks, perhaps with a json file summary of what they do (filenames and
checksums?). That would give your tools data to feed them, even if I'm
not convinced we should be dumping more and more data into the final
SPDX files.

Cheers,

Richard




^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2022-09-20 13:15 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-09-14 14:16 Adding more information to the SBOM Marta Rybczynska
2022-09-14 14:56 ` Joshua Watt
2022-09-14 17:10   ` [OE-core] " Alberto Pianon
2022-09-14 20:52     ` Joshua Watt
2022-09-15  1:16   ` [Openembedded-architecture] " Mark Hatle
2022-09-15 12:16 ` Richard Purdie
2022-09-16 15:18   ` Alberto Pianon
2022-09-16 15:49     ` Mark Hatle
2022-09-20 12:25       ` Alberto Pianon
2022-09-16 16:08     ` Richard Purdie
     [not found]       ` <1061592967.5114533.1663597215958.JavaMail.zimbra@piana.eu>
2022-09-20 13:15         ` Richard Purdie

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.