/ 6 min read

NextJS broke on large codebases!

Why NextJS was breaking on large codebases? what is next-swc? and how I debugged the issue.

Excalidraw diagram for working of next-swc bindings in Next build process

One fine monday at work, not everything was going fine as our development environment builds started to break with error:

thread '<unnamed>' panicked at <HOME_DIR>/.cargo/registry/src/index.crates.io-6f17d22bba15001f/swc_common-0.33.9/src/input.rs:31:9:
assertion failed: start <= end
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

This was very weird in sense, build was working fine before and we hadn’t updated next or any other dependencies including swc / swc plugins we were using in between the time it started breaking. All the extra added code was working fine on it’s own in development, thus was not a problem on our side. So, we had to deep dive into the issue itself to find out what’s going on?

Investigation

First step was to look into the file from where the assertion had failed, i.e. swc_common crate, there I didn’t find anything special, just a to check starting position of input is smaller or equal to it’s end position. Although, I got a feeling (obviously was not sure) there might be some integer overflow involved from where the StringInput is being initialized.

To confirm this feeling, logs were added for start and end variables in the code (by directly editing the local .cargo/registry 🙈), and build was ran. The result was start and end values were constantly rising and at some point end overflowed to become larger then the start. Thus confirmed, problem is integer overflow!

Although root cause was still unknown therefore to find root cause, next step I took was to run build with RUST_BACKTRACE=1 to get complete backtrace of the error, here is the interesting part of the trace:

....
11: <swc_common::input::StringInput>::new
at <HOME_DIR>/.cargo/registry/src/index.crates.io-6f17d22bba15001f/swc_common-0.33.24/src/input.rs:31:9
12: <swc_common::input::StringInput as core::convert::From<&swc_common::syntax_pos::SourceFile>>::from
at <HOME_DIR>/.cargo/registry/src/index.crates.io-6f17d22bba15001f/swc_common-0.33.24/src/input.rs:63:9
13: swc_ecma_parser::with_file_parser::<swc_ecma_ast::module::Script, swc_ecma_parser::parse_file_as_script::{closure#0}>
at <HOME_DIR>/.cargo/registry/src/index.crates.io-6f17d22bba15001f/swc_ecma_parser-0.143.15/src/lib.rs:471:44
14: swc_ecma_parser::parse_file_as_script
at <HOME_DIR>/.cargo/registry/src/index.crates.io-6f17d22bba15001f/swc_ecma_parser-0.143.15/src/lib.rs:497:13
15: swc_compiler_base::parse_js::{closure#0}
at <HOME_DIR>/.cargo/registry/src/index.crates.io-6f17d22bba15001f/swc_compiler_base-0.7.20/src/lib.rs:69:17
16: swc_compiler_base::parse_js
at <HOME_DIR>/.cargo/registry/src/index.crates.io-6f17d22bba15001f/swc_compiler_base-0.7.20/src/lib.rs:59:19
17: <swc::Compiler>::parse_js
at <HOME_DIR>/.cargo/registry/src/index.crates.io-6f17d22bba15001f/swc-0.273.26/src/lib.rs:423:9
18: <swc::Compiler>::minify::{closure#0}
at <HOME_DIR>/.cargo/registry/src/index.crates.io-6f17d22bba15001f/swc-0.273.26/src/lib.rs:803:26
19: <swc::Compiler>::run::<core::result::Result<swc_compiler_base::TransformOutput, anyhow::Error>, <swc::Compiler>::minify::{closure#0}>
at <HOME_DIR>/.cargo/registry/src/index.crates.io-6f17d22bba15001f/swc-0.273.26/src/lib.rs:238:9
20: <swc::Compiler>::minify
at <HOME_DIR>/.cargo/registry/src/index.crates.io-6f17d22bba15001f/swc-0.273.26/src/lib.rs:734:9
21: <next_swc_napi::minify::MinifyTask as napi::task::Task>::compute::{closure#0}::{closure#0}
at <HOME_DIR>/Code/github/next.js/packages/next-swc/crates/napi/src/minify.rs:95:21
....
27: <next_swc_napi::minify::MinifyTask as napi::task::Task>::compute
at <HOME_DIR>/Code/github/next.js/packages/next-swc/crates/napi/src/minify.rs:85:9
28: napi::async_work::execute::<next_swc_napi::minify::MinifyTask>
at <HOME_DIR>/.cargo/registry/src/index.crates.io-6f17d22bba15001f/napi-2.15.0/src/async_work.rs:100:5
....

So from this what we can see is the call starts from minify function exposed as part of napi binary of next-swc, then swc Compiler::minify was called to finally reach the breaking assertion. So next step was pretty simple, add logs for start and end throughout this process and read the code to find where is the start being assigned larger then the end position.

After some time into the investigation, here were the findings:

  1. There is a global static instance of SWC Compiler which is cloned and used for transform and minify in next-swc (ref: Github).

  2. The SWC Compiler has cm field which is Arc<SourceMap>, making the same sourcemap to be shared among every clone of compiler because of Arc.

  3. The SourceMap has start_pos field which calculates the start_pos of next SourceFile it needs to create. Which means this is effectively storing the count of every character present in sourcemap.

  4. SourceMap.start_pos may be a AtomicUsize, but it’s value gets stored in BytePos as a u32 via Pos trait implementation of it.

  5. The end_pos calculated in SourceFile::new adds the file’s length into the start_pos and stores it as a BytePos, i.e. a u32 (ref: Github).

  6. In Pos trait implementation of BytePos, as u32 was being used to directly convert type from usize.

From these we can easily deduce that because in Pos trait implementation of BytePos, we are using as u32 to directly convert type, this results in integer overflow to get un-noticed. Although, later-on when in swc_common::StringInput we assert start and end positions here, which fails as expected.

Possible solutions

From investigation, now we know the whole story why next build was breaking on our huge codebase (for that matter for any codebase which has total number of characters above or in order of 2^32), Next step was to fix the build and unblock the builds. There were only 2 ways I thought in which we could have fixed this at swc:

  1. Increase the size of BytePos to u64.
  2. Not share SourceMap instance across both transform and minify routines.

First point was likely a major change for SWC to do as in BytePos some range of it’s total range was reserved for handling comments positions. Therefore changing this would mean all code related to reserved space would have to be changed. This may not be a as bigger of a task as I think, @kdy1dev might know better.

Now as first point is out of question, we now have second point, there are 2 ways this could have been accomplished, either remove Arc from swc Compiler or not have a common global static instance of Compiler be shared between all routines in next-swc. But with my limited context of swc’s codebase, both solutions had some bad tradeoffs. If we remove Arc, it will not only change the API of SWC, it can also result in performance issues, and if we don’t share Compiler instance between the routines, it may impact the performance of next builds as well as it will result in loosing info of sourcemap of codes processed between each routine.

So, instead of raising the fix myself, I decided to raise the issue instead and take suggestions / insights of maintainers before making a PR to fix it. By this time I had shared my findings internally with my team, Then in a call to minimally reproduce the issue, me, Chinmay and Maulik wrote a script to generate the codebase in order of 2^32 such that it can be reproduced easily for maintainers, find repository for the same here. After this I raised the issue with my findings and reproductions here: https://github.com/vercel/next.js/issues/65436 and https://github.com/swc-project/swc/issues/8932.

Upon raising the issue, @kdy1dev quickly fixed the issue here in the PR (Yeah, my chance to contribute to next repo for this were ended here 😭, no complaints though). He chose to fix the same by removing a single static instance of SWC Compiler.

But what about the builds?

In between all this to temporary unblock the builds, we had removed the graphql-tag-swc-plugin from next configuration such that less amount of code gets generated for our bundle as input for minify routine therefore resulting in avoiding the integer overflow.

Even though the fix was done in next’s repo, this was not released, therefore to fix the builds meanwhile the change is release and we migrate our codebase to that version, we had to find some way to patch the issue on our end only. To do this, we created the next-swc’s napi bindings with fix via docker locally for linux and macos platforms and patched the next package to use our bindings instead of it’s own via yarn patch. (This was not as straight forward as it may seem here, maybe a story for another blog post).

Conclusions

Finally, the issue was fixed and we could go back to normal. Special thanks to Chinmay and Maulik from my team for letting be work on this issue, got to learn a lot by this exercise as well as felt a lot good in the end after solving this 😎.

Feel free to share your thoughts, mistakes or how this could have been done better on my DMs. Till the next blog post, see you later!