NodeJS cluster wrapper to gracefully manage workers
Regiment abuses the NodeJS cluster module in order to seamlessly replace workers after certain
criteria is met. The goal is to keep the cluster up without dropping requests.
npm install --save regiment
var Regiment = require('regiment');
var Express = require('express');
var app = Express();
// You can use either or both of the provided criteria middlewares, or contribute your own
app.use(Regiment.middleware.MemoryFootprint(750)); // Replace workers after rss reaches 750mb
app.use(Regiment.middleware.RequestCount(1000)); // Replace workers after every 1000 requests
Regiment(function(workerId) { return app.listen(); }); // default options
Regiment(function(workerId) { return app.listen(); }, options); // with options
{
numWorkers: 1, // Number of workers you want -- defaults to number of CPUs
deadline: 5000, // Milliseconds to wait for worker to gracefully die before forcing death
}
You have a leak in production and want your application to stay up while you figure out what is
going on or wait for a dependency to fix their leak.
You are familiar with max-old-space-size
and other V8 knobs that crash your application
when the threshold is met instead of gracefully responding to outstanding requests.
Workers use middleware to monitor for certain conditions like RSS size or requests served. When the
criteria for replacement is met, a worker signals that it needs to be replaced by sending a message
to the cluster.
The cluster receives the message and spins up a new worker. The cluster listens for the new worker
and sends a signal to the old worker which instructs it to not accept any new connections and to
exit after servicing all current requests. The old worker is then disconnected from the cluster
and receives no new requests.
Note: You can have up to 2x numWorkers
when replacements come online but before the old
ones gracefully die. This is temporary and by design as it drops back down to numWorkers
.
Note: By default, the number of workers is set to the number of available CPUs. This module works
just as well on small dynos where the number of CPUs is 1. A new worker is spawned and the old one
is replaced. The default for deadline is 15 seconds. HTTP-Cluster will wait this amount of time
for the worker to die by itself and then forcefully kill it.
I was heavily inspired by @hunterloftis’s Throng library and Forky.