/ 6 min read
NextJS broke on large codebases!
Why NextJS was breaking on large codebases? what is next-swc? and how I debugged the issue.
One fine monday at work, not everything was going fine as our development environment builds started to break with error:
This was very weird in sense, build was working fine before and we hadn’t updated next
or any other dependencies including swc / swc plugins we were using in between the time it started breaking. All the extra added code was working fine on it’s own in development, thus was not a problem on our side. So, we had to deep dive into the issue itself to find out what’s going on?
Investigation
First step was to look into the file from where the assertion had failed, i.e. swc_common
crate, there I didn’t find anything special, just a to check starting position of input is smaller or equal to it’s end position. Although, I got a feeling (obviously was not sure) there might be some integer overflow involved from where the StringInput
is being initialized.
To confirm this feeling, logs were added for start
and end
variables in the code (by directly editing the local .cargo/registry
🙈), and build was ran. The result was start and end values were constantly rising and at some point end overflowed to become larger then the start. Thus confirmed, problem is integer overflow!
Although root cause was still unknown therefore to find root cause, next step I took was to run build with RUST_BACKTRACE=1
to get complete backtrace of the error, here is the interesting part of the trace:
So from this what we can see is the call starts from minify
function exposed as part of napi binary of next-swc
, then swc Compiler::minify
was called to finally reach the breaking assertion. So next step was pretty simple, add logs for start
and end
throughout this process and read the code to find where is the start
being assigned larger then the end
position.
After some time into the investigation, here were the findings:
-
There is a global static instance of SWC Compiler which is cloned and used for transform and minify in next-swc (ref: Github).
-
The SWC Compiler has
cm
field which isArc<SourceMap>
, making the same sourcemap to be shared among every clone of compiler because ofArc
. -
The SourceMap has
start_pos
field which calculates thestart_pos
of nextSourceFile
it needs to create. Which means this is effectively storing the count of every character present in sourcemap. -
SourceMap.start_pos
may be aAtomicUsize
, but it’s value gets stored in BytePos as au32
via Pos trait implementation of it. -
The end_pos calculated in
SourceFile::new
adds the file’s length into thestart_pos
and stores it as aBytePos
, i.e. au32
(ref: Github). -
In Pos trait implementation of BytePos,
as u32
was being used to directly convert type fromusize
.
From these we can easily deduce that because in Pos trait implementation of BytePos, we are using as u32
to directly convert type, this results in integer overflow to get un-noticed. Although, later-on when in swc_common::StringInput
we assert start and end positions here, which fails as expected.
Possible solutions
From investigation, now we know the whole story why next build was breaking on our huge codebase (for that matter for any codebase which has total number of characters above or in order of 2^32
), Next step was to fix the build and unblock the builds. There were only 2 ways I thought in which we could have fixed this at swc
:
- Increase the size of
BytePos
tou64
. - Not share SourceMap instance across both transform and minify routines.
First point was likely a major change for SWC to do as in BytePos
some range of it’s total range was reserved for handling comments positions. Therefore changing this would mean all code related to reserved space would have to be changed. This may not be a as bigger of a task as I think, @kdy1dev
might know better.
Now as first point is out of question, we now have second point, there are 2 ways this could have been accomplished, either remove Arc
from swc Compiler
or not have a common global static instance of Compiler
be shared between all routines in next-swc
. But with my limited context of swc’s codebase, both solutions had some bad tradeoffs. If we remove Arc
, it will not only change the API of SWC, it can also result in performance issues, and if we don’t share Compiler
instance between the routines, it may impact the performance of next builds as well as it will result in loosing info of sourcemap of codes processed between each routine.
So, instead of raising the fix myself, I decided to raise the issue instead and take suggestions / insights of maintainers before making a PR to fix it. By this time I had shared my findings internally with my team, Then in a call to minimally reproduce the issue, me, Chinmay and Maulik wrote a script to generate the codebase in order of 2^32
such that it can be reproduced easily for maintainers, find repository for the same here. After this I raised the issue with my findings and reproductions here: https://github.com/vercel/next.js/issues/65436 and https://github.com/swc-project/swc/issues/8932.
Upon raising the issue, @kdy1dev
quickly fixed the issue here in the PR (Yeah, my chance to contribute to next repo for this were ended here 😭, no complaints though). He chose to fix the same by removing a single static instance of SWC Compiler
.
But what about the builds?
In between all this to temporary unblock the builds, we had removed the graphql-tag-swc-plugin
from next configuration such that less amount of code gets generated for our bundle as input for minify routine therefore resulting in avoiding the integer overflow.
Even though the fix was done in next’s repo, this was not released, therefore to fix the builds meanwhile the change is release and we migrate our codebase to that version, we had to find some way to patch the issue on our end only. To do this, we created the next-swc
’s napi bindings with fix via docker locally for linux and macos platforms and patched the next
package to use our bindings instead of it’s own via yarn patch
. (This was not as straight forward as it may seem here, maybe a story for another blog post).
Conclusions
Finally, the issue was fixed and we could go back to normal. Special thanks to Chinmay and Maulik from my team for letting be work on this issue, got to learn a lot by this exercise as well as felt a lot good in the end after solving this 😎.
Feel free to share your thoughts, mistakes or how this could have been done better on my DMs. Till the next blog post, see you later!