* Adding more information to the SBOM @ 2022-09-14 14:16 Marta Rybczynska 2022-09-14 14:56 ` Joshua Watt 2022-09-15 12:16 ` Richard Purdie 0 siblings, 2 replies; 11+ messages in thread From: Marta Rybczynska @ 2022-09-14 14:16 UTC (permalink / raw) To: OE-core, openembedded-architecture, Joshua Watt Dear all, (cross-posting to oe-core and *-architecture) In the last months, we have worked in Oniro on using the create-spdx class for both IP compliance and security. During this work, Alberto Pianon has found that some information is missing from the SBOM and it does not contain enough for Software Composition Analysis. The main missing point is the relation between the actual upstream sources and the final binaries (create-spdx uses composite sources). Alberto has worked on how to obtain the missing data and now has a POC. This POC provides full source-to-binary tracking of Yocto builds through a couple of scripts (intended to be transformed into a new bbclass at a later stage). The goal is to add the missing pieces of information in order to get a "real" SBOM from Yocto, which should, at a minimum: - carefully describe what is found in a final image (i.e. binary files and their dependencies), since that is what is actually distributed and goes into the final product; - describe how such binary files have been generated and where they come from (i.e. upstream sources, including patches and other stuff added from meta-layers); provenance is important for a number of reasons related to IP Compliance and security. The aim is to become able to: - map binaries to their corresponding upstream source packages (and not to the "internal" source packages created by recipes by combining multiple upstream sources and patches) - map binaries to the source files that have been actually used to build them - which usually are a small subset of the whole source package With respect to IP compliance, this would allow to, among other things: - get the real license text for each binary file, by getting the license of the specific source files it has been generated from (provided by Fossology, for instance), - and not the main license stated in the corresponding recipe (which may be as confusing as GPL-2.0-or-later & LGPL-2.1-or-later & BSD-3-Clause & BSD-4-Clause, or even worse) - automatically check license incompatibilities at the binary file level. Other possible interesting things could be done also on the security side. This work intends to add a way to provide additional data that can be used by create-spdx, not to replace create-spdx in any way. The sources with a long README are available at https://gitlab.eclipse.org/eclipse/oniro-compliancetoolchain/toolchain/tinfoilhat/-/tree/srctracker/srctracker What do you think of this work? Would it be of interest to integrate into YP at some point? Shall we discuss this? Marta and Alberto ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Adding more information to the SBOM 2022-09-14 14:16 Adding more information to the SBOM Marta Rybczynska @ 2022-09-14 14:56 ` Joshua Watt 2022-09-14 17:10 ` [OE-core] " Alberto Pianon 2022-09-15 1:16 ` [Openembedded-architecture] " Mark Hatle 2022-09-15 12:16 ` Richard Purdie 1 sibling, 2 replies; 11+ messages in thread From: Joshua Watt @ 2022-09-14 14:56 UTC (permalink / raw) To: Marta Rybczynska; +Cc: OE-core, openembedded-architecture On Wed, Sep 14, 2022 at 9:16 AM Marta Rybczynska <rybczynska@gmail.com> wrote: > > Dear all, > (cross-posting to oe-core and *-architecture) > In the last months, we have worked in Oniro on using the create-spdx > class for both IP compliance and security. > > During this work, Alberto Pianon has found that some information is > missing from the SBOM and it does not contain enough for Software > Composition Analysis. The main missing point is the relation between > the actual upstream sources and the final binaries (create-spdx uses > composite sources). I believe we map the binaries to the source code from the -dbg packages; is the premise that this is insufficient? Can you elaborate more on why that is, I don't quite understand. The debug sources are (basically) what we actually compiled (e.g. post-do_patch) to produce the binary, and you can in turn follow these back to the upstream sources with the downloadLocation property. > > Alberto has worked on how to obtain the missing data and now has a > POC. This POC provides full source-to-binary tracking of Yocto builds > through a couple of scripts (intended to be transformed into a new > bbclass at a later stage). The goal is to add the missing pieces of > information in order to get a "real" SBOM from Yocto, which should, at > a minimum: Please be a little careful with the wording; SBoMs have a lot of uses, and many of them we can satisfy with what we currently generate; it may not do the exact use case you are looking for, but that doesn't mean it's not a "real" SBoM :) > > - carefully describe what is found in a final image (i.e. binary files > and their dependencies), since that is what is actually distributed > and goes into the final product; > - describe how such binary files have been generated and where they > come from (i.e. upstream sources, including patches and other stuff > added from meta-layers); provenance is important for a number of > reasons related to IP Compliance and security. > > The aim is to become able to: > > - map binaries to their corresponding upstream source packages (and > not to the "internal" source packages created by recipes by combining > multiple upstream sources and patches) > - map binaries to the source files that have been actually used to > build them - which usually are a small subset of the whole source > package > > With respect to IP compliance, this would allow to, among other things: > > - get the real license text for each binary file, by getting the > license of the specific source files it has been generated from > (provided by Fossology, for instance), - and not the main license > stated in the corresponding recipe (which may be as confusing as > GPL-2.0-or-later & LGPL-2.1-or-later & BSD-3-Clause & BSD-4-Clause, or > even worse) IIUC this is the difference between the "Declared" license and the "Concluded" license. You can report both, and I think create-spdx.bbclass can currently do this with its rudimentary source license scanning. You really do want both and it's a great way to make sure that the "Declared" license (that is the license in the recipe) reflects the reality of the source code. > - automatically check license incompatibilities at the binary file level. > > Other possible interesting things could be done also on the security side. > > This work intends to add a way to provide additional data that can be > used by create-spdx, not to replace create-spdx in any way. > > The sources with a long README are available at > https://gitlab.eclipse.org/eclipse/oniro-compliancetoolchain/toolchain/tinfoilhat/-/tree/srctracker/srctracker > > What do you think of this work? Would it be of interest to integrate > into YP at some point? Shall we discuss this? This seems promising as something that could potentially move into core. I have a few points: - The extraction of the sources to a dedicated directory is something that Richard has been toying around with for quite a while, and I think it would greatly simplify that part of your process. I would very much encourage you to look at the work he's done, and work on that to get it pushed across the finish line as it's a really good improvement that would benefit not just your source scanning. - I would encourage you to not wait to turn this into a bbclass and/or library functions. You should be able to do this in a new layer, and that would make it much clearer as to what the path to being included in OE-core would look like. It also would (IMHO) be nicer to the users :) > > Marta and Alberto ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [OE-core] Adding more information to the SBOM 2022-09-14 14:56 ` Joshua Watt @ 2022-09-14 17:10 ` Alberto Pianon 2022-09-14 20:52 ` Joshua Watt 2022-09-15 1:16 ` [Openembedded-architecture] " Mark Hatle 1 sibling, 1 reply; 11+ messages in thread From: Alberto Pianon @ 2022-09-14 17:10 UTC (permalink / raw) To: Joshua Watt; +Cc: Marta Rybczynska, OE-core, openembedded-architecture Hi Joshua, nice to meet you! I'm new to this list, and I've always approached Yocto just from the "IP compliance side", so I may miss important pieces of information. That is why Marta encouraged me and is helping me to ask community feedback. Il 2022-09-14 16:56 Joshua Watt ha scritto: > On Wed, Sep 14, 2022 at 9:16 AM Marta Rybczynska <rybczynska@gmail.com> > wrote: >> >> Dear all, >> (cross-posting to oe-core and *-architecture) >> In the last months, we have worked in Oniro on using the create-spdx >> class for both IP compliance and security. >> >> During this work, Alberto Pianon has found that some information is >> missing from the SBOM and it does not contain enough for Software >> Composition Analysis. The main missing point is the relation between >> the actual upstream sources and the final binaries (create-spdx uses >> composite sources). > > I believe we map the binaries to the source code from the -dbg > packages; is the premise that this is insufficient? Can you elaborate > more on why that is, I don't quite understand. The debug sources are > (basically) what we actually compiled (e.g. post-do_patch) to produce > the binary, and you can in turn follow these back to the upstream > sources with the downloadLocation property. This was also my assumption at the beginning. But then I found that there are recipes with multiple upstream sources, which may be combined/mixed together in recipes' WORKDIR. For instance this one: https://git.yoctoproject.org/meta-virtualization/tree/recipes-networking/cni/cni_git.bb SRC_URI = "\ git://github.com/containernetworking/cni.git;branch=main;name=cni;protocol=https \ git://github.com/containernetworking/plugins.git;branch=release-1.1;destsuffix=${S}/src/github.com/containernetworking/plugins;name=plugins;protocol=https \ git://github.com/flannel-io/cni-plugin;branch=main;name=flannel_plugin;protocol=https;destsuffix=${S}/src/github.com/containernetworking/plugins/plugins/meta/flannel \ " (The third source is unpacked in a subdir of the second one) From here I discovered that we can't assume that the first non-local URI is the downloadLocation for all source files, because it is not always the case. Moreover, in the context of our project we also needed to find the upstream sources also for local patches, scripts, etc. added by recipes (i.e. the corresponding layers' repos). > >> >> Alberto has worked on how to obtain the missing data and now has a >> POC. This POC provides full source-to-binary tracking of Yocto builds >> through a couple of scripts (intended to be transformed into a new >> bbclass at a later stage). The goal is to add the missing pieces of >> information in order to get a "real" SBOM from Yocto, which should, at >> a minimum: > > Please be a little careful with the wording; SBoMs have a lot of uses, > and many of them we can satisfy with what we currently generate; it > may not do the exact use case you are looking for, but that doesn't > mean it's not a "real" SBoM :) You are right, sorry! "real" is meant in the context of our project, where we need to make our Fossology Audit Team work on "original" upstream source packages/repos, for a number of reasons (the main being that in Oniro project we have a complex build matrix with a lot of available target machines and quite a number of different overrides depending on the machine, so when it comes to IP compliance we need to aggregate and simplify, otherwise our IP auditors would die :) ) But since our Audit Team, differently from a commercial project, is working fully in the open, also other projects may benefit from this approach: having fully reviewed file-level license data publicly available for quite a number of upstream sources and Yocto layers, a complete source-to-binary tracking system would enable any Yocto projects to get very detailed license information for their images, to automatically detect license incompatibilities between linked binary files, etc. > >> >> - carefully describe what is found in a final image (i.e. binary files >> and their dependencies), since that is what is actually distributed >> and goes into the final product; >> - describe how such binary files have been generated and where they >> come from (i.e. upstream sources, including patches and other stuff >> added from meta-layers); provenance is important for a number of >> reasons related to IP Compliance and security. >> >> The aim is to become able to: >> >> - map binaries to their corresponding upstream source packages (and >> not to the "internal" source packages created by recipes by combining >> multiple upstream sources and patches) >> - map binaries to the source files that have been actually used to >> build them - which usually are a small subset of the whole source >> package >> >> With respect to IP compliance, this would allow to, among other >> things: >> >> - get the real license text for each binary file, by getting the >> license of the specific source files it has been generated from >> (provided by Fossology, for instance), - and not the main license >> stated in the corresponding recipe (which may be as confusing as >> GPL-2.0-or-later & LGPL-2.1-or-later & BSD-3-Clause & BSD-4-Clause, or >> even worse) > > IIUC this is the difference between the "Declared" license and the > "Concluded" license. You can report both, and I think > create-spdx.bbclass can currently do this with its rudimentary source > license scanning. You really do want both and it's a great way to make > sure that the "Declared" license (that is the license in the recipe) > reflects the reality of the source code. > The issue is with components like util-linux, which contains a lot of sub-components subject to different licenses; util-linux recipe's license is "GPL-2.0-or-later & LGPL-2.1-or-later & BSD-3-Clause & BSD-4-Clause", but from such information one cannot tell if a particular binary file generated from util-linux is subject to GPL, LGPL, or BSD-3|4-clause. Of course, being able to track upstream sources to binaries at file level would be useless if one doesn't have file-level license information; but since Scancode and Fossology (and our Audit Team) may provide such information, such tracking may become super-useful, in our opinion. >> - automatically check license incompatibilities at the binary file >> level. >> >> Other possible interesting things could be done also on the security >> side. >> >> This work intends to add a way to provide additional data that can be >> used by create-spdx, not to replace create-spdx in any way. >> >> The sources with a long README are available at >> https://gitlab.eclipse.org/eclipse/oniro-compliancetoolchain/toolchain/tinfoilhat/-/tree/srctracker/srctracker >> >> What do you think of this work? Would it be of interest to integrate >> into YP at some point? Shall we discuss this? > > This seems promising as something that could potentially move into > core. I have a few points: > - The extraction of the sources to a dedicated directory is something > that Richard has been toying around with for quite a while, and I > think it would greatly simplify that part of your process. I would > very much encourage you to look at the work he's done, and work on > that to get it pushed across the finish line as it's a really good > improvement that would benefit not just your source scanning. Thanks for the suggestion, could you point me to Richard's work? I'll surely look into it. > - I would encourage you to not wait to turn this into a bbclass > and/or library functions. You should be able to do this in a new > layer, and that would make it much clearer as to what the path to > being included in OE-core would look like. It also would (IMHO) be > nicer to the users :) Understood :) I'm the newbie here, so any other suggestion is warmly welcome. Regards, Alberto ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [OE-core] Adding more information to the SBOM 2022-09-14 17:10 ` [OE-core] " Alberto Pianon @ 2022-09-14 20:52 ` Joshua Watt 0 siblings, 0 replies; 11+ messages in thread From: Joshua Watt @ 2022-09-14 20:52 UTC (permalink / raw) To: Alberto Pianon; +Cc: Marta Rybczynska, OE-core, openembedded-architecture On Wed, Sep 14, 2022 at 12:10 PM Alberto Pianon <alberto@pianon.eu> wrote: > > Hi Joshua, > > nice to meet you! > > I'm new to this list, and I've always approached Yocto just from the > "IP compliance side", so I may miss important pieces of information. > That > is why Marta encouraged me and is helping me to ask community feedback. > > Il 2022-09-14 16:56 Joshua Watt ha scritto: > > On Wed, Sep 14, 2022 at 9:16 AM Marta Rybczynska <rybczynska@gmail.com> > > wrote: > >> > >> Dear all, > >> (cross-posting to oe-core and *-architecture) > >> In the last months, we have worked in Oniro on using the create-spdx > >> class for both IP compliance and security. > >> > >> During this work, Alberto Pianon has found that some information is > >> missing from the SBOM and it does not contain enough for Software > >> Composition Analysis. The main missing point is the relation between > >> the actual upstream sources and the final binaries (create-spdx uses > >> composite sources). > > > > I believe we map the binaries to the source code from the -dbg > > packages; is the premise that this is insufficient? Can you elaborate > > more on why that is, I don't quite understand. The debug sources are > > (basically) what we actually compiled (e.g. post-do_patch) to produce > > the binary, and you can in turn follow these back to the upstream > > sources with the downloadLocation property. > > This was also my assumption at the beginning. But then I found that > there > are recipes with multiple upstream sources, which may be combined/mixed > together in recipes' WORKDIR. For instance this one: > > https://git.yoctoproject.org/meta-virtualization/tree/recipes-networking/cni/cni_git.bb > > SRC_URI = "\ > git://github.com/containernetworking/cni.git;branch=main;name=cni;protocol=https > \ > > git://github.com/containernetworking/plugins.git;branch=release-1.1;destsuffix=${S}/src/github.com/containernetworking/plugins;name=plugins;protocol=https > \ > > git://github.com/flannel-io/cni-plugin;branch=main;name=flannel_plugin;protocol=https;destsuffix=${S}/src/github.com/containernetworking/plugins/plugins/meta/flannel > \ > " > > (The third source is unpacked in a subdir of the second one) > > From here I discovered that we can't assume that the first non-local URI > is the downloadLocation for all source files, because it is not always > the case. This is true, but I think that's more of a problem with the inability to express multiple download locations in the SPDX, not that we don't have all the source when we generate the SPDX, correct? I _beleive_ the -dbg package still contains all the source code from all three URLs? > > Moreover, in the context of our project we also needed to find the > upstream > sources also for local patches, scripts, etc. added by recipes (i.e. the > corresponding layers' repos). Ok, so this makes me wonder: If we implement the better source extraction in OE core, does that help this problem? Is the primary problem that you want the unpatched upstream source code files instead of the patched ones, or is it some other problem? AFAIK, the -dbg package contains the source code we actually compiled..... so I have a hard time understanding what's "incorrect" (or not ideal) about referencing it; but I think I'm missing something important :) > > > > >> > >> Alberto has worked on how to obtain the missing data and now has a > >> POC. This POC provides full source-to-binary tracking of Yocto builds > >> through a couple of scripts (intended to be transformed into a new > >> bbclass at a later stage). The goal is to add the missing pieces of > >> information in order to get a "real" SBOM from Yocto, which should, at > >> a minimum: > > > > Please be a little careful with the wording; SBoMs have a lot of uses, > > and many of them we can satisfy with what we currently generate; it > > may not do the exact use case you are looking for, but that doesn't > > mean it's not a "real" SBoM :) > > You are right, sorry! "real" is meant in the context of our project, > where we need to make our Fossology Audit Team work on "original" > upstream source packages/repos, for a number of reasons (the main being > that in Oniro project we have a complex build matrix with a lot of > available target machines and quite a number of different overrides > depending on the machine, so when it comes to IP compliance we need to > aggregate and simplify, otherwise our IP auditors would die :) ) > > But since our Audit Team, differently from a commercial project, > is working fully in the open, also other projects may benefit > from this approach: having fully reviewed file-level license > data publicly available for quite a number of upstream sources and > Yocto layers, a complete source-to-binary tracking system would > enable any Yocto projects to get very detailed license information > for their images, to automatically detect license incompatibilities > between linked binary files, etc. Ok, so let me see if I can follow what you want here: 1) Your Audit Team scans some open source repository, and generates some sort of license report for it 2) You do a Yocto build that builds that repository 3) You want to link the SBoM generated by Yocto back to the report from the Audit Team; specifically, you want be able to trace binaries in the system back to the original source code from Audit Team report? Currently #3 is difficult because 1) Yocto only reports one SRC_URI in the SBoM 2) Binary are tracked back to the as the patched source code (in the -dbg packages), so the checksums may not match the original upstream source code Any other reasons? > > > > >> > >> - carefully describe what is found in a final image (i.e. binary files > >> and their dependencies), since that is what is actually distributed > >> and goes into the final product; > >> - describe how such binary files have been generated and where they > >> come from (i.e. upstream sources, including patches and other stuff > >> added from meta-layers); provenance is important for a number of > >> reasons related to IP Compliance and security. > >> > >> The aim is to become able to: > >> > >> - map binaries to their corresponding upstream source packages (and > >> not to the "internal" source packages created by recipes by combining > >> multiple upstream sources and patches) > >> - map binaries to the source files that have been actually used to > >> build them - which usually are a small subset of the whole source > >> package > >> > >> With respect to IP compliance, this would allow to, among other > >> things: > >> > >> - get the real license text for each binary file, by getting the > >> license of the specific source files it has been generated from > >> (provided by Fossology, for instance), - and not the main license > >> stated in the corresponding recipe (which may be as confusing as > >> GPL-2.0-or-later & LGPL-2.1-or-later & BSD-3-Clause & BSD-4-Clause, or > >> even worse) > > > > IIUC this is the difference between the "Declared" license and the > > "Concluded" license. You can report both, and I think > > create-spdx.bbclass can currently do this with its rudimentary source > > license scanning. You really do want both and it's a great way to make > > sure that the "Declared" license (that is the license in the recipe) > > reflects the reality of the source code. > > > > The issue is with components like util-linux, which contains a lot of > sub-components subject to different licenses; util-linux recipe's > license is "GPL-2.0-or-later & LGPL-2.1-or-later & BSD-3-Clause & > BSD-4-Clause", but from such information one cannot tell if a particular > binary file generated from util-linux is subject to GPL, LGPL, or > BSD-3|4-clause. > > Of course, being able to track upstream sources to binaries at file > level would be useless if one doesn't have file-level license > information; > but since Scancode and Fossology (and our Audit Team) may provide such > information, such tracking may become super-useful, in our opinion. We also implement (and report) some rudimentary license scanning in Yocto, but we only look for "SPDX-License-Identifier" tags > > > >> - automatically check license incompatibilities at the binary file > >> level. > >> > >> Other possible interesting things could be done also on the security > >> side. > >> > >> This work intends to add a way to provide additional data that can be > >> used by create-spdx, not to replace create-spdx in any way. > >> > >> The sources with a long README are available at > >> https://gitlab.eclipse.org/eclipse/oniro-compliancetoolchain/toolchain/tinfoilhat/-/tree/srctracker/srctracker > >> > >> What do you think of this work? Would it be of interest to integrate > >> into YP at some point? Shall we discuss this? > > > > This seems promising as something that could potentially move into > > core. I have a few points: > > - The extraction of the sources to a dedicated directory is something > > that Richard has been toying around with for quite a while, and I > > think it would greatly simplify that part of your process. I would > > very much encourage you to look at the work he's done, and work on > > that to get it pushed across the finish line as it's a really good > > improvement that would benefit not just your source scanning. > > Thanks for the suggestion, could you point me to Richard's work? > I'll surely look into it. > > > - I would encourage you to not wait to turn this into a bbclass > > and/or library functions. You should be able to do this in a new > > layer, and that would make it much clearer as to what the path to > > being included in OE-core would look like. It also would (IMHO) be > > nicer to the users :) > > Understood :) > > I'm the newbie here, so any other suggestion is warmly welcome. > > Regards, > > Alberto ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [Openembedded-architecture] Adding more information to the SBOM 2022-09-14 14:56 ` Joshua Watt 2022-09-14 17:10 ` [OE-core] " Alberto Pianon @ 2022-09-15 1:16 ` Mark Hatle 1 sibling, 0 replies; 11+ messages in thread From: Mark Hatle @ 2022-09-15 1:16 UTC (permalink / raw) To: Joshua Watt, Marta Rybczynska; +Cc: OE-core, openembedded-architecture On 9/14/22 9:56 AM, Joshua Watt wrote: > On Wed, Sep 14, 2022 at 9:16 AM Marta Rybczynska <rybczynska@gmail.com> wrote: >> >> Dear all, >> (cross-posting to oe-core and *-architecture) >> In the last months, we have worked in Oniro on using the create-spdx >> class for both IP compliance and security. >> >> During this work, Alberto Pianon has found that some information is >> missing from the SBOM and it does not contain enough for Software >> Composition Analysis. The main missing point is the relation between >> the actual upstream sources and the final binaries (create-spdx uses >> composite sources). > > I believe we map the binaries to the source code from the -dbg > packages; is the premise that this is insufficient? Can you elaborate > more on why that is, I don't quite understand. The debug sources are > (basically) what we actually compiled (e.g. post-do_patch) to produce > the binary, and you can in turn follow these back to the upstream > sources with the downloadLocation property. When I last looked at this, it was critical that the analysis be: binary -> patched & configured source (dbg package) -> how the sources were constructed. As Joshua said above. I believe all of the information is present for this as you can tie the binary (through debug symbols) back to the debug package.. and the source of the debug package back to the sources that constructed it via heuristics. (If you enable the git patch mechanism. It should even be possible to use git blame to find exactly what upstreams constructed the patched sources. For generated content, it's more difficult -- but for those items usually there is a header which indicates what generated the content so other heuristics can be used. >> >> Alberto has worked on how to obtain the missing data and now has a >> POC. This POC provides full source-to-binary tracking of Yocto builds >> through a couple of scripts (intended to be transformed into a new >> bbclass at a later stage). The goal is to add the missing pieces of >> information in order to get a "real" SBOM from Yocto, which should, at >> a minimum: > > Please be a little careful with the wording; SBoMs have a lot of uses, > and many of them we can satisfy with what we currently generate; it > may not do the exact use case you are looking for, but that doesn't > mean it's not a "real" SBoM :) > >> >> - carefully describe what is found in a final image (i.e. binary files >> and their dependencies), since that is what is actually distributed >> and goes into the final product; >> - describe how such binary files have been generated and where they >> come from (i.e. upstream sources, including patches and other stuff >> added from meta-layers); provenance is important for a number of >> reasons related to IP Compliance and security. Full compliance will require binaries mapped to patched source to upstream sources _AND_ the instructions (layer/recipe/configuration) used to build them. But it's up to the local legal determination to figure out 'how far you really need to go', vs just "here are the layers I used to build my project".) >> The aim is to become able to: >> >> - map binaries to their corresponding upstream source packages (and >> not to the "internal" source packages created by recipes by combining >> multiple upstream sources and patches) >> - map binaries to the source files that have been actually used to >> build them - which usually are a small subset of the whole source >> package >> >> With respect to IP compliance, this would allow to, among other things: >> >> - get the real license text for each binary file, by getting the >> license of the specific source files it has been generated from >> (provided by Fossology, for instance), - and not the main license >> stated in the corresponding recipe (which may be as confusing as >> GPL-2.0-or-later & LGPL-2.1-or-later & BSD-3-Clause & BSD-4-Clause, or >> even worse) > > IIUC this is the difference between the "Declared" license and the > "Concluded" license. You can report both, and I think > create-spdx.bbclass can currently do this with its rudimentary source > license scanning. You really do want both and it's a great way to make > sure that the "Declared" license (that is the license in the recipe) > reflects the reality of the source code. And the thing to keep in mind is that in a given package the "Declared" is usually what a LICENSE file or header says. But the "Concluded" has levels of quality behind them. The first level of quality is "Declared". The next level is automation (something like fossology), the next level is human reviewed, and the highest level is "lawyer reviewed". So being able to inject SPDX information with Concluded values for evaluation and track the 'quality level' has always been something I wanted to do, but never had time. At the time, my idea was a database (and/or bbappend) for each component that would included pre-processed SPDX data for each recipe. This data would run through a validation step to show it actually matches the patched sources. (If any file checksums do NOT match, then they would be flagged for follow up.) >> - automatically check license incompatibilities at the binary file level. >> >> Other possible interesting things could be done also on the security side. >> >> This work intends to add a way to provide additional data that can be >> used by create-spdx, not to replace create-spdx in any way. >> >> The sources with a long README are available at >> https://gitlab.eclipse.org/eclipse/oniro-compliancetoolchain/toolchain/tinfoilhat/-/tree/srctracker/srctracker >> >> What do you think of this work? Would it be of interest to integrate >> into YP at some point? Shall we discuss this? > > This seems promising as something that could potentially move into > core. I have a few points: > - The extraction of the sources to a dedicated directory is something > that Richard has been toying around with for quite a while, and I > think it would greatly simplify that part of your process. I would > very much encourage you to look at the work he's done, and work on > that to get it pushed across the finish line as it's a really good > improvement that would benefit not just your source scanning. > - I would encourage you to not wait to turn this into a bbclass > and/or library functions. You should be able to do this in a new > layer, and that would make it much clearer as to what the path to > being included in OE-core would look like. It also would (IMHO) be > nicer to the users :) Agreed, this looks useful. The key is start turning it into one or more bbclasses now. Things that work with the Yocto Project process. Don't try to "post-process" and reconstruct sources. Instead inject steps that will run your file checksums, build up your database as the source are constructed. (i.e. do_unpack, do_patch..) etc. The key is, all of the information IS available. It just may not be in the format you want. --Mark >> >> Marta and Alberto >> >> >> -=-=-=-=-=-=-=-=-=-=-=- >> Links: You receive all messages sent to this group. >> View/Reply Online (#1635): https://lists.openembedded.org/g/openembedded-architecture/message/1635 >> Mute This Topic: https://lists.openembedded.org/mt/93678489/3616948 >> Group Owner: openembedded-architecture+owner@lists.openembedded.org >> Unsubscribe: https://lists.openembedded.org/g/openembedded-architecture/unsub [mark.hatle@kernel.crashing.org] >> -=-=-=-=-=-=-=-=-=-=-=- >> ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [Openembedded-architecture] Adding more information to the SBOM 2022-09-14 14:16 Adding more information to the SBOM Marta Rybczynska 2022-09-14 14:56 ` Joshua Watt @ 2022-09-15 12:16 ` Richard Purdie 2022-09-16 15:18 ` Alberto Pianon 1 sibling, 1 reply; 11+ messages in thread From: Richard Purdie @ 2022-09-15 12:16 UTC (permalink / raw) To: Marta Rybczynska, OE-core, openembedded-architecture, Joshua Watt On Wed, 2022-09-14 at 16:16 +0200, Marta Rybczynska wrote: > The sources with a long README are available at > https://gitlab.eclipse.org/eclipse/oniro-compliancetoolchain/toolchain/tinfoilhat/-/tree/srctracker/srctracker > > What do you think of this work? Would it be of interest to integrate > into YP at some point? Shall we discuss this? I had a look at this and was a bit puzzled by some of it. I can see the issues you'd have if you want to separate the unpatched source from the patches and know which files had patches applied as that is hard to track. There would be significiant overhead in trying to process and store that information in the unpack/patch steps and the archiver class does some of that already. It is messy, hard and doens't perform well. I'm reluctant to force everyone to do it as a result but that can also result in multiple code paths and when you have that, the result is that one breaks :(. I also can see the issue with multiple sources in SRC_URI, although you should be able to map those back if you assume subtrees are "owned" by given SRC_URI entries. I suspect there may be a SPDX format limit in documenting that piece? Where I became puzzled is where you say "Information about debug sources for each actual binary file is then taken from tmp/pkgdata/<machine>/extended/*.json.zstd". This is the data we added and use for the spdx class so you shouldn't need to reinvent that piece. It should be the exact same data the spdx class uses. I was also puzzled about the difference between rpm and the other package backends. The exact same files are packaged by all the package backends so the checksums from do_package should be fine. For the source issues above it basically it comes down to how much "pain" we want to push onto all users for the sake of adding in this data. Unfortunately it is data which many won't need or use and different legal departments do have different requirements. Experience with archiver.bbclass shows that multiple codepaths doing these things is a nightmare to keep working, particularly for corner cases which do interesting things with the code (externalsrc, gcc shared workdir, the kernel and more). Cheers, Richard ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [Openembedded-architecture] Adding more information to the SBOM 2022-09-15 12:16 ` Richard Purdie @ 2022-09-16 15:18 ` Alberto Pianon 2022-09-16 15:49 ` Mark Hatle 2022-09-16 16:08 ` Richard Purdie 0 siblings, 2 replies; 11+ messages in thread From: Alberto Pianon @ 2022-09-16 15:18 UTC (permalink / raw) To: Richard Purdie Cc: Marta Rybczynska, OE-core, openembedded-architecture, Joshua Watt, 'Carlo Piana', davide.ricci Hi Richard, thank you for your reply, you gave me very interesting cues to think about. I'll reply in reverse/importance order Il 2022-09-15 14:16 Richard Purdie wrote: > > For the source issues above it basically it comes down to how much > "pain" we want to push onto all users for the sake of adding in this > data. Unfortunately it is data which many won't need or use and > different legal departments do have different requirements. We didn't paint the overall picture sufficiently well, therefore our requirements may come across as coming from a particularly pedantic legal department; my fault :) Oniro is not "yet another commercial Yocto project", we are not a legal department (even if we are experienced FLOSS lawyers and auditors, the most prominent of whom is Carlo Piana -- cc'ed -- former general counsel of FSFE and member of OSI Board). Our rather ambitious goal is not limited to Oniro, and consists in doing compliance in the open source way and both setting an example and providing guidance and material for others to benefit from our effort. Our work will therefore be shared (and possibly improved by others) not only with Oniro-based projects but also with any Yocto project. Among other things, the most relevant bit of work that we want to share is **fully reviewed license information** and other legal metadata about a whole bunch of open source components commonly used in Yocto projects. To do that in a **scalable and fully automated way**, we need that Yocto collects some information that is currently disposed of (or simply not collected) at build time. Oniro Project Leader, Davide Ricci - cc'ed - strongly encouraged us to seek for feedback from you in order to find out the best way to do it. Maybe organizing a call would be more convenient than discussing background and requirements here, if you (and others) are available. > Experience > with archiver.bbclass shows that multiple codepaths doing these things > is a nightmare to keep working, particularly for corner cases which do > interesting things with the code (externalsrc, gcc shared workdir, the > kernel and more). > > I had a look at this and was a bit puzzled by some of it. > > I can see the issues you'd have if you want to separate the unpatched > source from the patches and know which files had patches applied as > that is hard to track. There would be significiant overhead in trying > to process and store that information in the unpack/patch steps and the > archiver class does some of that already. It is messy, hard and doens't > perform well. I'm reluctant to force everyone to do it as a result but > that can also result in multiple code paths and when you have that, the > result is that one breaks :(. > > I also can see the issue with multiple sources in SRC_URI, although you > should be able to map those back if you assume subtrees are "owned" by > given SRC_URI entries. I suspect there may be a SPDX format limit in > documenting that piece? I'm replying in reverse order: - there is a SPDX format limit, but it is by design: a SPDX package entity is a single sw distribution unit, so it may have only one downloadLocation; if you have more than one downloadLocation, you must have more than one SPDX package, according to SPDX specs; - I understand that my solution is a bit hacky; but IMHO any other *post-mortem* solution would be far more hacky; the real solution would be collecting required information directly in do_fetch and do_unpack - I also understand that we should reduce pain, otherwise nobody would use our solution; the simplest and cleanest way I can think about is collecting just package (in the SPDX sense) files' relative paths and checksums at every stage (fetch, unpack, patch, package), and leave data processing (i.e. mapping upstream source packages -> recipe's WORKDIR package -> debug source package -> binary packages -> binary image) to a separate tool, that may use (just a thought) a graph database to process things more efficiently. > > Where I became puzzled is where you say "Information about debug > sources for each actual binary file is then taken from > tmp/pkgdata/<machine>/extended/*.json.zstd". This is the data we added > and use for the spdx class so you shouldn't need to reinvent that > piece. It should be the exact same data the spdx class uses. > you're right, but in the context of a POC it was easier to extract them directly from json files than from SPDX data :) It's just a POC to show that required information may be retrieved in some way, implementation details do not matter > I was also puzzled about the difference between rpm and the other > package backends. The exact same files are packaged by all the package > backends so the checksums from do_package should be fine. > Here I may miss some piece of information. I looked at files in tmp/pkgdata but I couldn't find package file checksums anywhere: that is why I parsed rpm packages. But if such checksums were already available somewhere in tmp/pkgdata, it wouldn't be necessary to parse rpm packages at all... Could you point me to what I'm (maybe) missing here? Thanks! In any case, thank you much so for all your insights, they were super-useful! Cheers, Alberto ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [Openembedded-architecture] Adding more information to the SBOM 2022-09-16 15:18 ` Alberto Pianon @ 2022-09-16 15:49 ` Mark Hatle 2022-09-20 12:25 ` Alberto Pianon 2022-09-16 16:08 ` Richard Purdie 1 sibling, 1 reply; 11+ messages in thread From: Mark Hatle @ 2022-09-16 15:49 UTC (permalink / raw) To: Alberto Pianon, Richard Purdie Cc: Marta Rybczynska, OE-core, openembedded-architecture, Joshua Watt, 'Carlo Piana', davide.ricci On 9/16/22 10:18 AM, Alberto Pianon wrote: ... trimmed ... >> I also can see the issue with multiple sources in SRC_URI, although you >> should be able to map those back if you assume subtrees are "owned" by >> given SRC_URI entries. I suspect there may be a SPDX format limit in >> documenting that piece? > > I'm replying in reverse order: > > - there is a SPDX format limit, but it is by design: a SPDX package > entity is a single sw distribution unit, so it may have only one > downloadLocation; if you have more than one downloadLocation, you must > have more than one SPDX package, according to SPDX specs; I think my interpretation of this is different. I've got a view of 'sourcing materials', and then verifying the are what we think they are and can be used the way we want. The "upstream sources" (and patches) are really just 'raw materials' that we use the Yocto Project to combined to create "the source". So for the purpose of the SPDX, each upstream source _may_ have a corresponding SPDX, but for the binaries their source is the combined unit.. not multiple SPDXes. Think of it something like: upstream source1 - SPDX upstream source2 - SPDX upstream patch recipe patch1 recipe patch2 In the above, each of those items would be combined by the recipe system to construct the source used to build an individual recipe (and collection of packages). Automation _IS_ used to combine the components [unpack/fetch] and _MAY_ be used to generated a combined SPDX. So your "upstream" location for this recipe is the local machine's source archive. The SPDX for the local recipe files can merge the SPDX information they know (and if it's at a file level) can use checksums to identify the items not captured/modified by the patches for further review (either manual or automation like fossology). In the case where an upstream has SPDX data, you should be able to inherit MOST files this way... but the output is specific to your configuration and patches. 1 - SPDX | 2 - SPDX | patch |---> recipe specific SPDX patch | patch | In some cases someone may want to generate SPDX data for the 3 patches, but that may or may not be useful in this context. > - I understand that my solution is a bit hacky; but IMHO any other > *post-mortem* solution would be far more hacky; the real solution > would be collecting required information directly in do_fetch and > do_unpack I've not looked at the current SPDX spec, but past versions has a notes section. Assuming this is still present you can use it to reference back to how this component was constructed and the upstream source URIs (and SPDX files) you used for processing. This way nothing really changes in do_fetch or do_unpack. (You may want to find a way to capture file checksums and what the source was for a particular file.. but it may not really be necessary!) > - I also understand that we should reduce pain, otherwise nobody would > use our solution; the simplest and cleanest way I can think about is > collecting just package (in the SPDX sense) files' relative paths and > checksums at every stage (fetch, unpack, patch, package), and leave > data processing (i.e. mapping upstream source packages -> recipe's > WORKDIR package -> debug source package -> binary packages -> binary > image) to a separate tool, that may use (just a thought) a graph > database to process things more efficiently. Even it do_patch nothing really changes, other then again you may want to capture checksums to identify thingsthat need further processing. This approach greatly simplifies things, and gives people doing code reviews the insight into what is the source used when shipping the binaries (which is really an important aspect of this), as well as which recipe and "build" (really fetch/unpack/patch) were used to construct the sources. If they want to investigate the sources further back to their provider, then the notes would have the information for that, and you could transition back to the "raw materials" providers. >> >> Where I became puzzled is where you say "Information about debug >> sources for each actual binary file is then taken from >> tmp/pkgdata/<machine>/extended/*.json.zstd". This is the data we added >> and use for the spdx class so you shouldn't need to reinvent that >> piece. It should be the exact same data the spdx class uses. >> > > you're right, but in the context of a POC it was easier to extract them > directly from json files than from SPDX data :) It's just a POC to show > that required information may be retrieved in some way, implementation > details do not matter > >> I was also puzzled about the difference between rpm and the other >> package backends. The exact same files are packaged by all the package >> backends so the checksums from do_package should be fine. >> > > Here I may miss some piece of information. I looked at files in > tmp/pkgdata but I couldn't find package file checksums anywhere: that is > why I parsed rpm packages. But if such checksums were already available > somewhere in tmp/pkgdata, it wouldn't be necessary to parse rpm packages > at all... Could you point me to what I'm (maybe) missing here? Thanks! file checksumming is expensive. There are checksums available to individual packaging engines, as well as aggregate checksums for "hash equivalency".. but I'm not aware of any per-file checksum that is stored. You definitely shouldn't be parsing packages of any type (rpm or otherwise), as packages are truly optional. It's the binaries that matter here. --Mark > In any case, thank you much so for all your insights, they were > super-useful! > > Cheers, > > Alberto > > > > -=-=-=-=-=-=-=-=-=-=-=- > Links: You receive all messages sent to this group. > View/Reply Online (#1640): https://lists.openembedded.org/g/openembedded-architecture/message/1640 > Mute This Topic: https://lists.openembedded.org/mt/93678489/3616948 > Group Owner: openembedded-architecture+owner@lists.openembedded.org > Unsubscribe: https://lists.openembedded.org/g/openembedded-architecture/unsub [mark.hatle@kernel.crashing.org] > -=-=-=-=-=-=-=-=-=-=-=- > ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [Openembedded-architecture] Adding more information to the SBOM 2022-09-16 15:49 ` Mark Hatle @ 2022-09-20 12:25 ` Alberto Pianon 0 siblings, 0 replies; 11+ messages in thread From: Alberto Pianon @ 2022-09-20 12:25 UTC (permalink / raw) To: Mark Hatle Cc: Richard Purdie, Marta Rybczynska, OE-core, openembedded-architecture, Joshua Watt, 'Carlo Piana', davide.ricci Il 2022-09-16 17:49 Mark Hatle wrote: > On 9/16/22 10:18 AM, Alberto Pianon wrote: > > ... trimmed ... > >>> I also can see the issue with multiple sources in SRC_URI, although >>> you >>> should be able to map those back if you assume subtrees are "owned" >>> by >>> given SRC_URI entries. I suspect there may be a SPDX format limit in >>> documenting that piece? >> >> I'm replying in reverse order: >> >> - there is a SPDX format limit, but it is by design: a SPDX package >> entity is a single sw distribution unit, so it may have only one >> downloadLocation; if you have more than one downloadLocation, you >> must >> have more than one SPDX package, according to SPDX specs; > > I think my interpretation of this is different. I've got a view of > 'sourcing materials', and then verifying the are what we think they > are and can be used the way we want. The "upstream sources" (and > patches) are really just 'raw materials' that we use the Yocto Project > to combined to create "the source". > > So for the purpose of the SPDX, each upstream source _may_ have a > corresponding SPDX, but for the binaries their source is the combined > unit.. not multiple SPDXes. Think of it something like: > > upstream source1 - SPDX > upstream source2 - SPDX > upstream patch > recipe patch1 > recipe patch2 > > In the above, each of those items would be combined by the recipe > system to construct the source used to build an individual recipe (and > collection of packages). Automation _IS_ used to combine the > components [unpack/fetch] and _MAY_ be used to generated a combined > SPDX. > > So your "upstream" location for this recipe is the local machine's > source archive. The SPDX for the local recipe files can merge the > SPDX information they know (and if it's at a file level) can use > checksums to identify the items not captured/modified by the patches > for further review (either manual or automation like fossology). In > the case where an upstream has SPDX data, you should be able to > inherit MOST files this way... but the output is specific to your > configuration and patches. > > 1 - SPDX | > 2 - SPDX | > patch |---> recipe specific SPDX > patch | > patch | > > In some cases someone may want to generate SPDX data for the 3 > patches, but that may or may not be useful in this context. IMHO it's a matter of different ways of framing Yocto recipes into SPDX format. Upstream sources are all SPDX packages. Yocto layers are SPDX packages, too, containing some PATCH_FOR upstream packages. Upstream sources and yocto layers are the "final" upstream sources, and each of them has its downloadLocation. "The source" created by a recipe is another SPDX package, GENERATED_FROM upstream source packages + recipe and patches from Yocto layer package(s). "The source" may need to be distributed by downstream users (eg. to comply with *GPL-* obligations or when providing SDKs), so downstream users may made it available from their own infrastructure, "giving" it a downloadLocation. (in SPDX, GENERATED_FROM and PATCH_FOR relationships may be between files, so one may map files found in "the source" package to individual files found in upstream source packages) Binary packages GENERATED_FROM "the source" are local SPDX packages, too. And firmware images are SPDX packages, too, GENERATED_FROM all the above. Firmware images are distributed by downstream users, who will provide their own downloadLocation. > >> - I understand that my solution is a bit hacky; but IMHO any other >> *post-mortem* solution would be far more hacky; the real solution >> would be collecting required information directly in do_fetch and >> do_unpack > > I've not looked at the current SPDX spec, but past versions has a > notes section. Assuming this is still present you can use it to > reference back to how this component was constructed and the upstream > source URIs (and SPDX files) you used for processing. > > This way nothing really changes in do_fetch or do_unpack. (You may > want to find a way to capture file checksums and what the source was > for a particular file.. but it may not really be necessary!) > If you want to automatically map all files to their corresponding upstram sources, it actually is... see my next point >> - I also understand that we should reduce pain, otherwise nobody would >> use our solution; the simplest and cleanest way I can think about >> is >> collecting just package (in the SPDX sense) files' relative paths >> and >> checksums at every stage (fetch, unpack, patch, package), and >> leave >> data processing (i.e. mapping upstream source packages -> recipe's >> WORKDIR package -> debug source package -> binary packages -> >> binary >> image) to a separate tool, that may use (just a thought) a graph >> database to process things more efficiently. > > Even it do_patch nothing really changes, other then again you may want > to capture checksums to identify thingsthat need further processing. > > > This approach greatly simplifies things, and gives people doing code > reviews the insight into what is the source used when shipping the > binaries (which is really an important aspect of this), as well as > which recipe and "build" (really fetch/unpack/patch) were used to > construct the sources. If they want to investigate the sources > further back to their provider, then the notes would have the > information for that, and you could transition back to the "raw > materials" providers. The point is precisely that we would like to help people avoid doing this job, because if you scale up to n different yocto projects it would be a time-consuming, error-prone and hardly maintainable process. Since SPDX allows to represent relationships between any kind of entities (files, packages), we would like to use that feature to map local source files to upstream source files, so machines may do the job instead of people -- and people (auditors) may concentrate on reviewing upstream sources -- i.e. the atomic ingredients used across different projects or across different versions of the same project. > >>> >>> Where I became puzzled is where you say "Information about debug >>> sources for each actual binary file is then taken from >>> tmp/pkgdata/<machine>/extended/*.json.zstd". This is the data we >>> added >>> and use for the spdx class so you shouldn't need to reinvent that >>> piece. It should be the exact same data the spdx class uses. >>> >> >> you're right, but in the context of a POC it was easier to extract >> them >> directly from json files than from SPDX data :) It's just a POC to >> show >> that required information may be retrieved in some way, implementation >> details do not matter >> >>> I was also puzzled about the difference between rpm and the other >>> package backends. The exact same files are packaged by all the >>> package >>> backends so the checksums from do_package should be fine. >>> >> >> Here I may miss some piece of information. I looked at files in >> tmp/pkgdata but I couldn't find package file checksums anywhere: that >> is >> why I parsed rpm packages. But if such checksums were already >> available >> somewhere in tmp/pkgdata, it wouldn't be necessary to parse rpm >> packages >> at all... Could you point me to what I'm (maybe) missing here? Thanks! > > file checksumming is expensive. There are checksums available to > individual packaging engines, as well as aggregate checksums for "hash > equivalency".. but I'm not aware of any per-file checksum that is > stored. > > You definitely shouldn't be parsing packages of any type (rpm or > otherwise), as packages are truly optional. It's the binaries that > matter here. You are definitely right. I guess that it should be done (optionally) in do_package ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [Openembedded-architecture] Adding more information to the SBOM 2022-09-16 15:18 ` Alberto Pianon 2022-09-16 15:49 ` Mark Hatle @ 2022-09-16 16:08 ` Richard Purdie [not found] ` <1061592967.5114533.1663597215958.JavaMail.zimbra@piana.eu> 1 sibling, 1 reply; 11+ messages in thread From: Richard Purdie @ 2022-09-16 16:08 UTC (permalink / raw) To: Alberto Pianon Cc: Marta Rybczynska, OE-core, openembedded-architecture, Joshua Watt, 'Carlo Piana', davide.ricci On Fri, 2022-09-16 at 17:18 +0200, Alberto Pianon wrote: > Il 2022-09-15 14:16 Richard Purdie wrote: > > > > For the source issues above it basically it comes down to how much > > "pain" we want to push onto all users for the sake of adding in this > > data. Unfortunately it is data which many won't need or use and > > different legal departments do have different requirements. > > We didn't paint the overall picture sufficiently well, therefore our > requirements may come across as coming from a particularly pedantic > legal department; my fault :) > > Oniro is not "yet another commercial Yocto project", we are not a legal > department (even if we are experienced FLOSS lawyers and auditors, the > most prominent of whom is Carlo Piana -- cc'ed -- former general counsel > of FSFE and member of OSI Board). > > Our rather ambitious goal is not limited to Oniro, and consists in doing > compliance in the open source way and both setting an example and > providing guidance and material for others to benefit from our effort. > Our work will therefore be shared (and possibly improved by others) not > only with Oniro-based projects but also with any Yocto project. Among > other things, the most relevant bit of work that we want to share is > **fully reviewed license information** and other legal metadata about a > whole bunch of open source components commonly used in Yocto projects. I certainly love the goal. I presume you're going to share your review criteria somehow? There must be some further set of steps, documentation and results beyond what we're discussing here? I think the challenge will be whether you can publish that review with sufficient "proof" that other legal departments can leverage it. I wouldn't underestimate how different the requirements and process can be between different people/teams/companies. > To do that in a **scalable and fully automated way**, we need that Yocto > collects some information that is currently disposed of (or simply not > collected) at build time. > > Oniro Project Leader, Davide Ricci - cc'ed - strongly encouraged us to > seek for feedback from you in order to find out the best way to do it. > > Maybe organizing a call would be more convenient than discussing > background and requirements here, if you (and others) are available. I don't mind having a call but the discussion in this current form may have an important element we shouldn't overlook, which is that it isn't just me you need to convince on some of this. If, for example, we should radically change the unpack/patch process, we need to have a good explanation for why people need to take that build time/space/resource hit. If we conclude that on a call, the case to the wider community would still have to be made. > > Experience > > with archiver.bbclass shows that multiple codepaths doing these things > > is a nightmare to keep working, particularly for corner cases which do > > interesting things with the code (externalsrc, gcc shared workdir, the > > kernel and more). > > > > I had a look at this and was a bit puzzled by some of it. > > > > I can see the issues you'd have if you want to separate the unpatched > > source from the patches and know which files had patches applied as > > that is hard to track. There would be significiant overhead in trying > > to process and store that information in the unpack/patch steps and the > > archiver class does some of that already. It is messy, hard and doens't > > perform well. I'm reluctant to force everyone to do it as a result but > > that can also result in multiple code paths and when you have that, the > > result is that one breaks :(. > > > > I also can see the issue with multiple sources in SRC_URI, although you > > should be able to map those back if you assume subtrees are "owned" by > > given SRC_URI entries. I suspect there may be a SPDX format limit in > > documenting that piece? > > I'm replying in reverse order: > > - there is a SPDX format limit, but it is by design: a SPDX package > entity is a single sw distribution unit, so it may have only one > downloadLocation; if you have more than one downloadLocation, you must > have more than one SPDX package, according to SPDX specs; I think we may need to talk to the SPDX people about that as I'm not convinced it always holds that you can divide software into such units. Certainly you can construct a situation where there are two repositories, each containing a source file which are only ever linked together as one binary. > - I understand that my solution is a bit hacky; but IMHO any other > *post-mortem* solution would be far more hacky; the real solution > would be collecting required information directly in do_fetch and > do_unpack Agreed, this needs to be done at unpack/patch time. Don't underestimate the impact of this on general users though as many won't appreciate slowing down their builds generating this information :/. There is also a pile of information some legal departments want which you've not mentioned here, such as build scripts and configuration information. Some previous discussions with other parts of the wider open source community rejected Yocto Projects efforts as insufficient since we didn't mandate and capture all of this too (the archiver could optionally do some of it iirc). Is this just the first step and we're going to continue dumping more data? Or is this sufficient and all any legal department should need? > - I also understand that we should reduce pain, otherwise nobody would > use our solution; the simplest and cleanest way I can think about is > collecting just package (in the SPDX sense) files' relative paths and > checksums at every stage (fetch, unpack, patch, package), and leave > data processing (i.e. mapping upstream source packages -> recipe's > WORKDIR package -> debug source package -> binary packages -> binary > image) to a separate tool, that may use (just a thought) a graph > database to process things more efficiently. I'd suggest stepping back and working out whether the SPDX requirement of a "single download location" some of this stems from really makes sense. > > Where I became puzzled is where you say "Information about debug > > sources for each actual binary file is then taken from > > tmp/pkgdata/<machine>/extended/*.json.zstd". This is the data we added > > and use for the spdx class so you shouldn't need to reinvent that > > piece. It should be the exact same data the spdx class uses. > > > > you're right, but in the context of a POC it was easier to extract them > directly from json files than from SPDX data :) It's just a POC to show > that required information may be retrieved in some way, implementation > details do not matter Fair enough, I just want to be clear we don't want to duplicate this. > > > I was also puzzled about the difference between rpm and the other > > package backends. The exact same files are packaged by all the package > > backends so the checksums from do_package should be fine. > > > > Here I may miss some piece of information. I looked at files in > tmp/pkgdata but I couldn't find package file checksums anywhere: that is > why I parsed rpm packages. But if such checksums were already available > somewhere in tmp/pkgdata, it wouldn't be necessary to parse rpm packages > at all... Could you point me to what I'm (maybe) missing here? Thanks! In some ways this is quite simple, it is because at do_package time, the output packages don't exist, only their content. The final output packages are generated in do_package_write_{ipk|deb|rpm}. You'd probably have to add a stage to the package_write tasks which wrote out more checksum data since the checksums are only known at the end of those tasks. I would question whether adding this additional checksum into the SPDX output actually helps much in the real world though. I guess it means you could look an RPM up against it's checksum but is that something people need to do? Cheers, Richard ^ permalink raw reply [flat|nested] 11+ messages in thread
[parent not found: <1061592967.5114533.1663597215958.JavaMail.zimbra@piana.eu>]
* Re: [Openembedded-architecture] Adding more information to the SBOM [not found] ` <1061592967.5114533.1663597215958.JavaMail.zimbra@piana.eu> @ 2022-09-20 13:15 ` Richard Purdie 0 siblings, 0 replies; 11+ messages in thread From: Richard Purdie @ 2022-09-20 13:15 UTC (permalink / raw) To: Carlo Piana Cc: Alberto Pianon, Marta Rybczynska, OE-core, openembedded-architecture, Joshua Watt, davide ricci On Mon, 2022-09-19 at 16:20 +0200, Carlo Piana wrote: > thank you for a well detailed and sensible answer. I certainly cannot > speak on technical issues, although I can understand there are > activities which could seriously impact the overall process and need > to be minimized. > > > > On Fri, 2022-09-16 at 17:18 +0200, Alberto Pianon wrote: > > > Il 2022-09-15 14:16 Richard Purdie wrote: > > > > > > > > For the source issues above it basically it comes down to how much > > > > "pain" we want to push onto all users for the sake of adding in this > > > > data. Unfortunately it is data which many won't need or use and > > > > different legal departments do have different requirements. > > > > > > We didn't paint the overall picture sufficiently well, therefore our > > > requirements may come across as coming from a particularly pedantic > > > legal department; my fault :) > > > > > > Oniro is not "yet another commercial Yocto project", we are not a legal > > > department (even if we are experienced FLOSS lawyers and auditors, the > > > most prominent of whom is Carlo Piana -- cc'ed -- former general counsel > > > of FSFE and member of OSI Board). > > > > > > Our rather ambitious goal is not limited to Oniro, and consists in doing > > > compliance in the open source way and both setting an example and > > > providing guidance and material for others to benefit from our effort. > > > Our work will therefore be shared (and possibly improved by others) not > > > only with Oniro-based projects but also with any Yocto project. Among > > > other things, the most relevant bit of work that we want to share is > > > **fully reviewed license information** and other legal metadata about a > > > whole bunch of open source components commonly used in Yocto projects. > > > > I certainly love the goal. I presume you're going to share your review > > criteria somehow? There must be some further set of steps, > > documentation and results beyond what we're discussing here? > > Our mandate (and our own attitude) is precisely to make everything as > public as possible. > > We have published already about it > https://gitlab.eclipse.org/eclipse/oniro-compliancetoolchain/toolchain/docs/-/tree/main/audit_workflow > > The entire review process is made using GitLab's issues and will be > made public. I need to read into the details but that looks like a great start and I'm happy to see the process being documented! Thanks for the link, I'll try and have a read. > We have only one reservation concerning sensitive material > in case we found something legally problematic (to comply with > attorney/client privilege) or security-wise critic (in which case we > adopt a responsible disclosure principle and embargo some details). That makes sense, it is a tricky balancing act at times. > > I think the challenge will be whether you can publish that review with > > sufficient "proof" that other legal departments can leverage it. I > > wouldn't underestimate how different the requirements and process can > > be between different people/teams/companies. > > Speaking from a legal perspective, this is precisely the point. It is > true that we want to create a curated database of decisions, which as > any human enterprise is prone to errors and correction and therefore > we cannot have the last word. However, IF we can at least point to a > unique artifact and give its exact hash, there will be no need to > trust us, that would be open to inspection, because everybody else > could look at the same source we have identified and make sure we > have extracted all the information. I do love the idea and I think it is quite possible. I do think this does lead to one of the key details we need to think about though. From a legal perspective I'd imagine you like dealing with a set of files that make up the source of some piece of software. I'm not going to use the word "package" since I think the term is overloaded and confusing. That set of files can all be identified by checksums. This pushes us towards wanting checksums of every file. Stepping over to the build world, we have bitbake's fetcher and it actually requires something similar - any given "input" must be uniquely identifiable from the SRC_URI and possibly a set of SRCREVs. Why? We firstly need to embed this information into the task signature. If it changes, we know we need to rerun the fetch and re-obtain the data. We work on inputs to generate this hash, not outputs and we require all fetcher modules to be able to identify sources like this. In the case of a git repo, the hash of a git commit is good enough. For a tarball, it would be a checksum of the tarball. Where there are patch local files, we include the hashes of those files. The bottom line is that we already have a hash which represents the task inputs. Bugs happen, sure. There are also poor fetchers, npm and go present challenges in particular but we've tried to work around those issues. What you're saying is that you don't trust what bitbake does, so you want all the next level of information about the individual files. In theory we could put the SRC_URI and SRCREVs into the SPDX as the source (which could be summarised into a task hash) rather than the upstream url. It all depends which level you want to break things down to. I do see a case for needing the lower level info as in review, you are going to want to know the delta to the last review decisions. You also prefer having a different "upstream" url form for some kinds of checks like CVEs. It does feel a lot like we're trying to duplicate information and cause significant growth of the SPDX files without an actual definitive need. You could equally put in a mapping between a fetch task checksum and the checksums of all the files that fetch task would expand to if run (it should always do it deterministicly). > To be clearer, we are not discussing here the obligation to provide > the entire corresponding source code as with *GPLv3, but rather we > are seeking to establish the *provenance* of the software, of all > bits (also in order to see what patch has been applied by who and to > close which vulnerability, in case). My worry is that by not considering the obligation, we don't cater for a portion of the userbase and by doing so, we limit the possible adoption. > Provenance also has a great impact on "reproducibility" of legal work > on sources. If we are not able to tell what has gone into our package > from where (and this may prove hard and require a lot of manual - and > therefore error-prone - work especially in case of complex Yocto > recipes using f.e. crate/cargo or npm(sw) fetchers), we (lawyers and > compliance specialists) are at a great disadvantage proving we have > covered all our bases. I understand this more than you realise as we have the same problem in the bitbake fetcher and have spent a lot of time trying to solve it. I won't claim we're there for some of the modern runtimes and I'd love help in both explaining to the upstream projects why we need this and help to technically fix the fetchers so these modern runtimes work better. > This is a very good point, and I can vouch that this is really > important, but maybe you are reading too much in here: at this stage, > our goal is not to convince anyone to radically change Yocto tasks to > meet our requirements, but it is to share such requirements and their > rationale, collect your feedback and possibly adjust them, and also > to figure out the least impactful solution to meet them (possibly > without radical changes but just by adding optional functions in > existing tasks). "optional functions" fill me with dread, this is the archiver problem I mentioned. One of the things I try really hard to do is to have one good way of doing things rather than multiple options with different levels of functionality. If you give people choices, they use them. When someone's build fails, I don't want to have to ask "which fetcher were you using? Did you configure X or Y or Z?". If we can all use the same code and codepaths, it means we see bugs, we see regressions and we have a common experience without the need for complex test matrices. Worst case you can add optional functions but I kind of see that as a failure. If we can find something with low overhead which we can all use, that would be much better. Whether it is possible, I don't know but it is why we're having the discussion. This is why I have a preference for trying to keep common code paths for the core though. > > > - I understand that my solution is a bit hacky; but IMHO any other > > > *post-mortem* solution would be far more hacky; the real solution > > > would be collecting required information directly in do_fetch and > > > do_unpack > > > > Agreed, this needs to be done at unpack/patch time. Don't underestimate > > the impact of this on general users though as many won't appreciate > > slowing down their builds generating this information :/. > > Can't this be made optional, so one could just go for the "old" way > without impacting much? Sorry I'm stepping where I'm naive. See above :). > > > > > There is also a pile of information some legal departments want which > > you've not mentioned here, such as build scripts and configuration > > information. Some previous discussions with other parts of the wider > > open source community rejected Yocto Projects efforts as insufficient > > since we didn't mandate and capture all of this too (the archiver could > > optionally do some of it iirc). Is this just the first step and we're > > going to continue dumping more data? Or is this sufficient and all any > > legal department should need? > > > > I think that trying to give all legal departments what they want > would prove impossible. I think the idea here is more to start > building a collectively managed database of provenance and licensing > data, with a curate set of decision for as many packages available as > possible. This way that everybody can have some good clue -- and > increasingly a better one -- as to which license(s) apply to which > package, removing much of the guesswork that is required today. It makes sense and is a worthy goal. I just wish we could key this off bitbake's fetch task checksum rather than having to dump reams of file checksums! > We ourselves reuse a lot of information coming from Debian's machine- > readable information, sometimes finding mistakes and opening issues > upstream. That helped us to cut the licensing harvesting information > and review by a great deal. This does explain why the bitbake fetch mechanism would be a struggle for you though as you don't want to use our fetch units as your base component (which is why we end up struggling with some of the issues). In the interests of moving towards a conclusion, I think what we'll end up needing to do is generate more information from the fetch and patch tasks, perhaps with a json file summary of what they do (filenames and checksums?). That would give your tools data to feed them, even if I'm not convinced we should be dumping more and more data into the final SPDX files. Cheers, Richard ^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2022-09-20 13:15 UTC | newest] Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2022-09-14 14:16 Adding more information to the SBOM Marta Rybczynska 2022-09-14 14:56 ` Joshua Watt 2022-09-14 17:10 ` [OE-core] " Alberto Pianon 2022-09-14 20:52 ` Joshua Watt 2022-09-15 1:16 ` [Openembedded-architecture] " Mark Hatle 2022-09-15 12:16 ` Richard Purdie 2022-09-16 15:18 ` Alberto Pianon 2022-09-16 15:49 ` Mark Hatle 2022-09-20 12:25 ` Alberto Pianon 2022-09-16 16:08 ` Richard Purdie [not found] ` <1061592967.5114533.1663597215958.JavaMail.zimbra@piana.eu> 2022-09-20 13:15 ` Richard Purdie
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.