Re: [bitbake-devel] Improving npm(sw) fetcher & integration within Bitbake

From: Stefan Herbrechtsmeier <stefan.herbrechtsmeier-oss@weidmueller.com>
To: richard.purdie@linuxfoundation.org,
	Jasper Orschulko <Jasper.Orschulko@iris-sensing.com>,
	"bitbake-devel@lists.openembedded.org"
	<bitbake-devel@lists.openembedded.org>
Cc: "martin@mko.dev" <martin@mko.dev>,
	Daniel Baumgart <Daniel.Baumgart@iris-sensing.com>
Subject: Re: [bitbake-devel] Improving npm(sw) fetcher & integration within Bitbake
Date: Fri, 5 Nov 2021 10:07:28 +0100	[thread overview]
Message-ID: <42a0350d-991a-731d-29c2-83cd62da8c7b@weidmueller.com> (raw)
In-Reply-To: <0b63fba531fc94bbe915dfc9915c0a2f42ad3ce9.camel@linuxfoundation.org>

Hi Jasper and Richard,

Am 05.11.2021 um 00:15 schrieb Richard Purdie via lists.openembedded.org:
> On Thu, 2021-11-04 at 12:29 +0000, Jasper Orschulko wrote:
>> Dear Bitbake developers,
>>
>> recently we have been looking at the npmsw fetcher and discovered some
>> challenges regarding the integration into the developer workflow as
>> well as the build times within Bitbake. We believe that we found a
>> mechanism which would integrate well into Bitbake's existing project
>> structure and drastically improve the situation.
>>
>>
>> But first, what are the issues with the current npmsw fetcher?
>>
>> 1. Let's have a look at a typical npm-based project. You'd typically
>> have your package-lock.json (aka shrinkwrap file) stored within the git
>> repository containing your source code. Developers will rely on this
>> package-lock file on a daily basis during the development cycle.
>> Unfortunately, the current npmsw fetcher only supports shrinkwrap files
>> stored within the meta layer or within an npm registry. This is not
>> ideal, as changes to the file might be made within the project repo,
>> which then need to be manually applied to the lock file within the meta
>> repo. An ideal npmsw fetcher therefore would support using the lock
>> file directly from the source code repo.

The package-lock.json and npm-shrinkwrap.json are identical. The only 
difference is that the npm-shrinkwrap.json could be published with the 
package.

>> 2. The current implementation of the npm class uses multiple shellouts
>> per npm module in order to add these to the npm cache. This is done, as
>> the `npm install` command is not called within the do_fetch, but at the
>> end of the do_configure step. This drastically increases the time
>> Bitbake spends in the do_configure step for a npm based recipe. In our
>> case (we have a relatively small project with approx. 600 npm packages
>> in total, including recursive packages) this takes ~100 minutes to
>> complete. What makes things worse, every change to the recipe and/or
>> lock file will cause a complete rerun of the do_configure job.

This is a problem of the sequential setup of the cache. I have a 
prototype to do this in a special bb task and use multiple parallel 
process task inside the bb task. But I have also a prototype which 
remove the complete cache and speed up the build significant.

>> As a result, the npm fetcher currently is not really usable for
>> production workloads.

Ack.

>> So how can we address these issues?
>>
>> We plan to implement a "sub-fetcher" for npmsw (a concept which might
>> also be recyclable for similar use-cases). This would take the
>> form of e.g.:
>>
>> SRC_URI = "npmsw+git://git-uri.git;npm-topdir=path_to_npm_project;..."
>>
>> The idea is, that the npsw fetcher would then call an arbitrary sub-
>> fetcher (in this case git, however any fetcher will be supported) and
>> after the sub-fetcher has extracted the source code into the DL_DIR,
>> the npm fetcher will create a secondary download folder as a copy of
>> the sub-fetchers download folder. Within this copy, the npm fetcher
>> will call `npm ci`, effectively downloading the npm packages by doing a
>> clean-install on the basis of the package.json and the package-
>> lock.json files within the npmsw download dir. This results in a much
>> faster build, as it removes the need for seperate handling of the
>> individual node packages, as well as streamlining the developers
>> workflow with the build process within Bitbake.

How should this support the download proxy? The npm ci command need a 
repository or a cache to work.

Furthermore you need a patch step in between the fetch steps to support 
tuning / fixing of the configuration before the second fetch step.

>> As this fetcher would be implemented separately from the current npmsw
>> fetcher, this will not cause any breaking changes for existing setups.
>>
>> Additionally, we plan on writing a separate npmsw.bbclass, which will
>> parse the package.json for each node module for an automated Bitbake
>> license manifest generation, which will resolve the current challenge
>> of having to maintain these manually, as currently described at
>> https://www.yoctoproject.org/docs/latest/mega-manual/mega-manual.html#npm-using-the-registry-modules-method
>> .

This licenses will be generated by the recipetool and you could provide 
checksums to detect the correct licenses.

The license inside the package.json is only a hint and you need a 
license file to fulfill the license compliance. Because of this I remove 
the package.json from LIC_FILES_CHKSUM because it is useless for the 
license compliance.

>> If this is something you see as a worthwhile goal, we will provide a
>> set of patch files within the coming weeks.

I think you mixed the unusable npm implementation with your special use 
case.

The problem is that the current npm implementation isn't really usable. 
I'm working on this and have already a prototype that could install, 
build and *test* a proprietary angular project and node-red as well as 
koa/examples from github.

If I understand you correct you like to build a npm recipe that could 
change it dependencies without update the recipe except the SRCREV of 
the repositories.

> At a first read it sounds reasonable but I don't know the answers to a few
> questions which make or break things from an OE/bitbake perspective. Those
> questions are:
> 
> a) Once DL_DIR has been populated by this fetch mechanism, can a subsequent
> build run with just the data from there without accessing the network?
> 
> b) Is the information encoded into SRC_URI enough to give a deterministic build
> result, i.e. if we run this build at some later date, will we get the same
> result?
> 
> c) Is fetching only happening during the do_fetch task and not in any subsequent
> step?
> 
> 
> I'd love for some of the other people who're worked on this code to jump in as I
> don't use it or understand it in detail. I am worried about how we maintain this
> longer term as different people seem to have different use cases which sees the
> code changing in different directions and we're starting to look like we may end
> up with multiple ways of doing things which I really dislike.

This leads to the questions what is the desired way to integrate a 
package / dependency manager. Nowadays any language (even C/C++) has a 
package manager available and more and more build systems (ex. Meson, 
CMake) support automatic download of dependencies. The common 
integration into OE is a script (recipetool) that generate a recipe 
from the foreign configuration. The current npm implementation is 
special because it reuse a foreign configuration and translate it into 
fetch commands on-the-fly. This leads to the problem that common tweaks 
like override a dependency or share configuration between recipes via 
include file isn't possible. We could fix it by removing the foreign 
configuration and do the translation during recipe creation. But this 
means you have to recreate the recipe after every dependency change.

Is it a valid use case for OE to support foreign dependency 
configurations like npm-shrinkwrap.json, go.sum or conan.lock?

Regards
   Stefan