[spec: webpack 5] - `target.webpackGraph` - Persistent graph target
See original GitHub issueCurrent Problems & Scenarios
To have the most performant web app, users have to leverage Rollup to package libs and webpack consume those libs: webpack when it bundles, contains module wrappers and runtimes for running in the browser or another support target (IE: node, electron etc.). People can choose to create a UMD, or CommonJS output, however when this format is consumed by webpack (for another application), this adds un-removable overhead in build sizes. Therefore we have recommended in the past that developers who are using webpack, but also creating component libraries that webpack consumes, that they leverage tooling like RollupJS so that the bundling is as efficient as possible. This not only causes unnecessary work, but more mental cost & time on our users to leverage separate tooling.
Build overhead not just size: On top of that, any time webpack consumes a webpack bundle, it will have to parse and evaluate all code paths that are bundled. If you build a library once, and you are not actively developing it in your application, webpack shouldn’t have to build it at all except once
Building component libraries and monorepo’s" When your project/application scales, both problem one and two will become more prevalent, as you are building the app and adding needs O(x*buildTime) the time complexity of build out the library itself.
Processing non-js assets: To complicate things, if you tried to build a .vue
or .css in js
component, you would have to set library-specific loaders and plugins or ship the raw source so that your app build could consume.
Proposed Solution
New reusable caching primitives: We can solve a majority of these problems above via implementation of a new persistent caching primitive. By short circuiting the Compilation#seal()
phase, and forcing webpack to output the dependency graph (JSON? Protobuf? Flatbuf?) to disk, we could (in theory) cut half of the build time complexity from webpack.
From there, webpack will collect this webpackGraph (similar to records.json
) and hydrate this information into the full object modeled Dependency Graph which is stored in memory during build today. This would happen near the same time that records.json
would be processed, however we would need to have the capability to merge the graphs together, before the Compilation#seal()
phase (where everything is optimized and put into chunks and chunkGroups, etc).
[Unsure about this] In addition, loaders that are applied to said component library (could?) be referenced and leveraged from original components node_modules
so that there is determinism between the loaders used in an application bundle and ones being used to create the library.
User Stories (That speak in solving spirit of these problem areas)
Priority | Story |
---|---|
1 | As a library author, I can publish a “webpackGraph” file alongside another target that webpack can consume. This graph contains serialized information about modules, dependencies, and their graph relation to eachother (in addition to loaders applied, module and dependency types, and EntryPoint cacheGroups) |
1 | As a web application users, I can consume a libraries “webpackGraph” file w/o any configuration (automatically resolved). In addition, when I modify parts of my application, webpack will not traverse, resolve, or parse (?) the original library sources. |
Non-Goals
- We will not focus (in this RFC) creating an “ESM” library target output. This is something we need, but outside of the scope of this spec.
- We will not support legacy module types, shimmed modules, or any Variable Injection due to ProvidePlugin. We primarily want to focus on webpack’s first class module types and key loaders.
Requirements
-
Creation of Graph Hydration and DeHydration API’s - This enables us to take the circular referencing complex object graph, and store it into a file state (json, protobuf, flatbuf, snapshot?). If serialization is expensive we should explore C, C++, Rust, WASM Compilable friendly alternatives with JS fallback.
-
Compilation will build graph and serialize it if specific options.target, [or whatever named] property is set. Whether or not the graph is serialized piece by piece or all at once should be chose based on a combination of performance and maintainability. After
Compilation#finishModules
webpack should exit/return as if the build was complete [aka no chunk processing etc]. -
Compilation will collect and load graph in memory at the same
records
is detected this is also loaded into a in memory module build cache once hydrated. As webpack traces the application graph, it will first check to see if instances of modules already exist with same identifiers and omit the cost of tracing them. -
A lib/component that is using the webpackGraph target, should fail if they are using AMD, CJS, ProvidePlugin(), or anything that will cause fully dynamic nature or extra complexities or dimensions to walking the graph.
-
A webpackGraph lib that is consumed should be on average the same build time as if it was a DLL module.
Questions
-
How would same-loaders with different options from lib to app differentiate their functionality? I believe they will reference the same location with different options so is this an issue?
-
When should the webpackGraph hydrate? (My feeling is when a module that resolves, if it has a graphCache in the same root path as the module being resolved, that webpack would favor the cache and start tracing and hydrating that part of the graph).
Fundementals
0CJS
Is there a zero config story here? What is the cost of discovering webpackGraph’s imperatively while modules resolve? What’s the perf win from declaring them up front. What does this mean for users.
Speed
This ideally improves speed fundamentals.
Build Size
Does this compromise build sizes at all?
Security
Does this open areas of concern for vulnerabilities?
Success Metric
- webpack build times should significantly decrease for those consuming component libraries
- webpack build times should significantly decrease for those creating webpackGraph library targets
- Reduction in memory footprint(?).
Issue Analytics
- State:
- Created 6 years ago
- Reactions:37
- Comments:17 (13 by maintainers)
ESM output (flattened or not) is something that developers will require in the future. So that’s the right direction.
Also thinking about how to break up a large monolithic build into several concurrent once is right.
Just don’t try to do this for libraries by asking lib owners to ship webpack specific data structures.
@TheLarkInn This spec is awesome. I’ve thought of something similar along this line of thought a few times working on
hard-source-webpack-plugin
.So relaying on what I’ve experienced maintaining and further developing hard-source:
How will webpack 6 work with this?
A big question this will need to answer is forward and backwards compatibility. Will webpack 6 be able to use version 5 graphs? Does it choose parts of version 5 graphs to use piecemeal?
How do library authors support webpackGraph releases when webpack 6 first comes out? Do they publish the library containing both webpack 5 and 6 graphs? Does a single graph support containing both? Does webpack 6 create webpack 5 graphs or do library authors have to use webpack-version-manager to produce both?
Externals? A library depends on lodash for example.
With an aim at performance in regards to time used to build and output size a webpackGraph will want to only include the libraries own files. They can rely on package.json to state semver’s of dependencies they depend on as libraries already do.
I think this would be a default behaviour for a webpackGraph target.
This affects other details for a webpackGraph. I leave those details to what they affect.
Caching Primitives
I’m going to list a set of primitives that I think will help maturing this spec. Some of these may be using the idea of primitives loosely be larger caching values but I think they’ll help.
primitives
I think dehydrating this set of objects will be easy. That’s probably because I’ve been working on this “area” of webpack for so long. 😄 Or rather the direct dehydrating and hydrating of these types is not the hard part. Its the surrounding API that gives the flexibility to make it easy.
The troubling recursive parts are minimal. Dehydrating these types, most of the needed values to hydrate are options to their constructors. A small list of other needed values are set by Helpers in webpack. The list is easy to track down, but better we make those more “visible” in webpack 5 so they are easier to maintain. Visibility will probably be something like jsdocs at minimum.
For the dependencies, most of the needed values to hydrate are the arguments passed to their constructors. (This is one of the parts that will possibly make figuring out forward and backwards compatibility happen as dependencies change between major webpack versions. Either changing their arguments, or removing and adding whole new dependency types.
NormalModule I think may illustrate another part of dehydrating. Relationship of what we dehydrate to lifecycle. A Module goes through three primary stages, in part represented by its methods:
constructor()
,build()
, andsource()
. For NormalModule the interesting times to dehydrate it are after these calls finish or their callbacks are called (constructor()
isn’t really useful to dehydrate but inside it and the super classes are a lot of the needed values listed for NormalModule).build()
is the most useful, dehydrating the modules blocks, dependencies, and original source (the parsed output from the loaders).source()
is useful, after it the rendered ReplaceSource can be dehydrated. Maybe like these there are methods we could add to Dependency and Source types to illustrate their lifecycle stages (and internal values) we want to dehydrate. Again though I think jsdocs and updating some constructors to assign null, false, or empty strings to members they may gain but start at that point with.Some values are not needed. A key example is
originModule
inHarmonyImportDependency
. Its set to the module that has that dependency object. Hydrating dependencies I think will follow how dependencies are originally created and added to a module through the parser plugins. A shared state object will be used by the hydrating methods soHarmonyImportDependency
for example can refer to its module. Another harmony value like this isHarmonyExportImportedSpecifierDependency
’sotherStarExports
.After circular references like that, any others are set by webpack’s normal process, and we can probably leave the rest of webpack’s existing machinations to do that.
Loaders and Options
The question that may need answering before this one is: How do we tell two different option sets apart? This may mean restricting loader options for webpackGraphs to declarative options. Or if any loader is applied to a graphed module, the stored module will not be used and webpack will build a new module from the original source provided in the library.
webpack may want to add a performance hint if too many modules are rebuilt that are available in graphs.
pitch-loaders like style-loader won’t regularly have a real affect on this if they pitch a module with no further loaders, making them depend on a graphed module.
Traverse? Resolve? Parse?
I don’t think they’ll be parsed.
I do think they’ll be traversed and resolved. Both traverse and resolving would allow individual modules to be rebuilt, like if a user wants to super optimize an individual or set of files with a loader. Traversing will also be needed for relying on other dependencies like lodash.
If this is how its done, the graph may be a loose one, containing at a high level internal resolutions and modules. The connections between different modules will be hydrate by webpack’s normal behaviour.
Non-JS Assets
CSS might have a clear path. If a user is ok with the css-loader options applied to a graphed CSS module, they could use style-loader or extract-text to load the CSS into the page while style-loader and extract-text will use the graphed module and toString its output CSS as they already do.
Images or other binary assets leave a question. A basic one, url-loader or file-loader? Which should webpackGraph’s use? I think the answer may be a new loader (and dependency, parser plugin, and module?) used by graph targets to emit and refer to the asset but leave the
url-
orfile-
part to downstream users in some fashion.DLL-like performance?
This may be an unfair target but maybe I’m nit picking this a little too much. DLL modules are very lean in the amount of work they have to do. A graphed module in at least production mode will need to render a source and possibly a source map with the expected used imports and exports. A development version might be able to approach a DLL module’s build time by including a “prerendered” development source in the graph that disables used exports, thus exporting all exports with their original names.
Disk Format
I’m not sure of all the types of the format but I have some lessons from hard-source for this. However the string version is produced, there should be no giant object stringified at one time, and any stringifying should be done right as that value is written to disk. This is to keep the process from building too large a string (Node on my systems will not make a string bigger than 130MBs and that can happen in large projects if you try to stringify a truly giant object. Leaving stringifying to right before writing an item to disk helps protect against running out of memory. Once the string is written to disk, it’s no longer referred to and can be garbage collected. The string can’t be garbage collected if you iterate over an array replacing all the objects with their string versions, and then write those strings one at a time to disk. The non stringified version has smaller strings that can be repeatedly referred to, reducing memory needed to represent the dehydrated cache.
With that in mind the dehydrating and hydrating API will probably expect to output an object when dehydrating and input an object when hydrating. Turning the object and turning it into a string or Buffer will be handled by a second step.
On other format details, I’d guess the larger output object might be some kind of tar like or db like format. Either we’re likely to read in a whole graph at times, reading one or smaller set of files for performance, or something db like seeking to and reading the needed parts of the graph at that time. Inside would be the individual modules, etc in JSON, protobuf, Flatbuf format along with some table.
I think this comment ended being about how much I was going to answer and too much. 😉