Finding a needle in a haystack

When you’re working in a monolithic legacy codebase spanning hundreds of thousands of lines, finding that one notorious function—or even a single line of code—can feel like searching for a needle in a haystack.

Last year, my team was tasked with a critical mission: stabilize SignDesk to ensure higher uptime and reliability.

But the problem statement was vague. Where do you begin in such a vast and aging system?

Observability First

We quickly realized that the first step was observability. Without visibility, it’s like walking into a dark room. So, we started with Node.js profiling.

We profiled the production server during peak transaction hours—1 hour per session—to identify the functions and lines of code consuming the most CPU time.

What we found was eye-opening.

Counterintuitive Insights

Contrary to intuition, it wasn’t the large, complex functions that were choking the system. Instead, seemingly harmless, simple functions were spending 30–40 seconds on the CPU.

A common culprit? Nested for loops and .find() operations on large arrays.

At scale, these innocent-looking iterations were crippling server performance.

The Fix: Smarter Data Structures

The solution was straightforward yet powerful: better data structures.

We moved from arrays to hash maps. While an array lookup is O(n), a hashmap lookup is O(1)—a game-changer at production scale.

Nested loops that previously hit O(n²) complexity were now lean and fast. With millions of transactions, this change alone brought CPU spikes down from 70–80% to a healthy baseline.

The same function that once hogged the CPU for 30+ seconds now executed in just 2 seconds.

Unblocking the Event Loop

We also discovered several synchronous file I/O operations that were blocking the event loop. These were converted into asynchronous calls, keeping the JavaScript event loop free and responsive.

Separating Concerns

Another performance issue was heavy cron schedulers running during business hours—right alongside critical transaction flows.

We extracted these into a dedicated microservice, allowing the monolith to focus solely on transactions, vastly improving throughput and system reliability.

Resilience Through Redundancy

In systems dependent on external vendors, uptime is often out of your hands. If a vendor is down, your product goes down with it.

To address this, we integrated multiple vendors (NSDL & CVL) for the same service and built an intelligent auto-switch mechanism. This system detects vendor downtime in real time and switches to a secondary provider seamlessly—often before users even notice.

The Results

All these initiatives—profiling, restructuring, async ops, microservices, and intelligent failovers—culminated in a massive win:

50% reduction in server resource allocation
Zero downtime during peak transaction months
Significantly fewer customer complaints
Greater confidence across teams