Loading large amounts of data performantly using Node.js Streams
Sometimes you have to load tons of data from a database, a file, etc. So much so that your Node service crashes due to the dreaded "Out of Memory" error. A common use case is generating large CSV exports - think financial transactions from a site like Mint or RocketMoney, or high-frequency sensor data. Users might need these exports to import into their own data pipeline or transfer to another tool.
Note: Usually that export will happen in a task queue, and then will be emailed to the user when the task is compelte.
If your implementation is naive and the data you are trying to load is large enough, a single user might be able to crash your server.
Now imagine when you have several users trying to export data, and things can get out of control quickly.
Instead, the data should load efficiently without consuming all the available memory (and you shouldn't have to keep bumping up the amount of memory on your server instance), and going forward you won't have to worry about things unexpectedly crashing and painfully troubleshooting what's causing it.
In order to achieve this, a good fix is to use "Streams", which are first-class citizens in Node and commonly supported by popular libraries.
The problem without Streams
If you are running into memory issues, your current implementation probably looks something like this:
const fs = require('fs');
const { Client } = require('pg');
const { Parser } = require('json2csv');
async function exportData() {
const client = new Client();
await client.connect();
const result = await client.query('SELECT * FROM transactions'); // Loads everything into memory
const json2csvParser = new Parser();
const csv = json2csvParser.parse(result.rows);
fs.writeFileSync('output.csv', csv); // Writes entire data set at once
await client.end();
}
exportData();
This approach works fine for small datasets but fails when dealing with much larger datasets containing potentially millions of rows. What happens in this code is:
- the entire dataset is loaded from the database into memory
- when the data has finished loading in memory, it then gets written to a CSV file
At this point if it's too big to fit in memory - or you have several (or even many) users performing the same operation which gets executed on the same server instance - your service will likely run out of RAM and crash.
And your user will never receive the data they requested.
The Stream-based approach
To avoid this problem, this is where Node Streams come in. Instead of reading data into memory all at once like in the above example, Streams read data chunk by chunk, a bit at a time, and allow you to process that data without keeping it all in memory.
You have almost definitely heard of "streaming" in another context: Netflix/YouTube/etc. Conceptually it is essentially the same. Instead of having to wait for a whole video or episode to load in memory in your browser or phone before you can begin watching, it loads at a chunk at a time so you can begin watching right away and it doesn't eat up all the memory on your device.
Using this concept, here's how we can stream data from a PostgreSQL database table and write it directly to a CSV file:
const fs = require('fs');
const { Client } = require('pg');
const QueryStream = require('pg-query-stream');
const { stringify } = require('csv-stringify');
const client = new Client();
async function exportData() {
await client.connect();
// Create a Read Stream
const query = new QueryStream('SELECT * FROM transactions');
const queryStream = client.query(query);
// Create a CSV stringify stream
const csvStream = stringify({ header: true });
// Create a writable file stream
const fileStream = fs.createWriteStream('output.csv');
// Pipe database stream → CSV stream → file
queryStream.pipe(csvStream).pipe(fileStream);
fileStream.on('finish', async () => {
await client.end();
console.log('Export completed.');
});
fileStream.on('error', async (err) => {
console.error('Error writing file:', err);
await client.end();
});
}
exportData().catch(console.error);
The above uses the pg
module as the PostgreSQL database client. Often in Node you will be using Knex but along with the Postgres module but for simplicity sake in this example I am just going to use the pg
module, together with pg-query-stream
to handle the "Read Stream".
What is happening is:
- First we create the "Read Stream" and pass the SQL query
- Then we create a CSV "Transform Stream" (more on what this is in a minute)
- Then we create a file "Write Stream"
queryStream.pipe(csvStream).pipe(fileStream)
- this is where the magic of streaming happens:
- As the
queryStream
is reading chunks of data at a time, that data is being processed by the "downstream" streams,csvStream
andfileStream
automatically as the chunks are read, rather than waiting for all of the data from the database to be fetched. - The
csvStream
(a Transform Stream) both reads and writes:- Reads: Takes raw database rows as input
- Writes: Converts them into properly formatted CSV output
- This is why it is called a Transform Stream
- Lastly, the
fileStream
(Write Stream) then takes the output of thecsvStream
and writes it to the file.
And as data "ends up" in the .csv file, that data is "discarded" from the Read Stream and the Transform Stream, and this is the "magic" behind making Streams memory-efficient approach for working with large amounts of data.
Note: we know the fileStream
(Read) and csvStreams
(Transform, read/write) discard their data when it's been read and transformed, and we know the fileStream
is a Write Stream, but I believe a Write Stream still allocated some memory for its operations and thus also pseudo-discards and reallocates memory space as data is read. I am not 100% sure about this but my research into Write Streams suggests that at a lower level/under the hood this is how they work. Ultimately a technical detail and not that crucial for understanding how Streams work as a whole.
What if you don't transform the data?
You might be wondering what would happen if instead of transforming the data to CSV (the csvStream
above) you just skipped that and wrote to the fileStream
instead, like this:
queryStream.pipe(fileStream);
As data is read from the database, it will be automatically written to the .csv file, so we still use streaming but skip the transform step. That data may not end up looking like how you want it to though, as in parsed correctly to fit CSV columns and rows, but it will be read in a streaming way.
Now, what if we also removed the .pipe(fileStream)
step too, so instead of writing to the fileStream
as you are reading from the queryStream
(database), you waited for all the data in the queryStream
to be fetched first?
To be honest I am not sure if this is possible with Streams, but ultimately would be defeating the purpose of Streams - which is to "pipe" the data somewhere (to transform it, or to have it end up in a 'target' somewhere, in this case a file on the filesystem) at the same time as you are reading it and to do so in a memory-efficient way. If you are just loading a large amount of data that you would then hold in memory for longer periods of time, then streaming is not the use case for that.
Streaming directly to the Front-end/Browser
Above we explained streaming to a file, but what if we wanted our server to stream to a browser? The use case for this would be, for example, instead of the user downloading the data as a CSV file, they need it displayed directly one the front-end.
First, I would be very careful to ensure:
- The user absolutely needs all that data on the front-end as typically that much data would be overwhelming, both from a browser memory perspective as well as just cognitively overwhelming for the user to act on
- They can't use pagination
- Writing the data to a file won't otherwise suffice
Push back on the requirements if you have to, just make sure this is something you absolutely can't avoid, because it brings its own inneficiencies (like burdening the browser with high memory usage as I just called out).
But if this absolutely must be done, here is how you can do it. From a learning perspective, it's helpful to understand this as it shows another "target" for Read Streams.
// Server code
const express = require('express');
const { Client } = require('pg');
const QueryStream = require('pg-query-stream');
const app = express();
const port = 3000;
app.get('/stream', async (req, res) => {
const client = new Client();
await client.connect();
const query = new QueryStream('SELECT * FROM transactions');
const queryStream = client.query(query); // Readable Stream
// Set headers for NDJSON streaming
res.setHeader('Content-Type', 'application/x-ndjson');
queryStream.on('data', (row) => {
res.write(JSON.stringify(row) + '\n'); // Send each row as a separate JSON object with newline
});
queryStream.on('end', async () => {
res.end();
await client.end();
});
queryStream.on('error', async (err) => {
console.error('Streaming error:', err);
res.status(500).send('Error streaming data');
await client.end();
});
req.on('close', async () => {
await client.end(); // Ensure DB connection is closed if client disconnects
});
});
app.listen(port, () => {
console.log(`Server running on http://localhost:${port}`);
});
// Front-end code
const fetchNDJSON = async () => {
const response = await fetch('/stream');
if (!response.body) {
console.error('Streaming not supported');
return;
}
const reader = response.body.getReader();
const decoder = new TextDecoder();
let partialData = '';
while (true) {
const { done, value } = await reader.read();
if (done) break;
partialData += decoder.decode(value, { stream: true });
let lines = partialData.split('\n');
partialData = lines.pop(); // Keep last partial line for next chunk
for (let line of lines) {
if (line.trim()) {
console.log('Received row:', JSON.parse(line)); // Process each JSON row
}
}
}
console.log('Stream completed.');
};
fetchNDJSON();
They key in the above code is to use the NDJSON spec. By setting the response header as application/x-ndjson
, we tell the browser that this is a JSON response, in newline delimited format. Which means the browser can parse the JSON as it is received, rather than all at once.
So, as the queryStream
reads data from the database, that gets returned by the Express REST API, and the browser reads it/parses it as it comes in.
In this case, eventually all the data will load unless in your front-end code you discard or filter it, so this is something to be aware of. You could still end up with memory issues but this time in the browser.
When to use streaming vs. pagination
Pagination is a different concept entirely than streaming, but it is still a very viable option to use instead of fetching a large amount of data (whether you use streaming or fetch it all at once).
Pagination means you load consecutive chunks of data per request. Take for example Google search results - first your browser will request X number of search results be returned, and that is one HTTP request to Google. Then if you click one of the consecutive numbers at the bottom of the serach results page, it will make another HTTP request to load the next set of data. But you are neither loading all the data at once, nor streaming it.
While streaming is great for large exports, it's not always the best choice. The general guidelines I use to choose one over the other are:
- Use streaming when you need to process large amounts of data in a single request, such as exporting huge datasets.
- Use pagination when dealing with interactive user queries where a user is browsing through data (for example, looking at the latest week of sensor data, then paginating the next week, then the next, etc). This keeps response times fast and reduces memory usage on both the server and the client.
Should You Even Load All This Data?
The original use case/feature described of a user exporting large amounts of data is both realistic, and chosen because it's a good example to demonstrate how to implement streaming.
That being said, if this request comes to you while working on your app, you should still question if it's absolutely necessary. Before implementing a stream-based solution, ask yourself - does the user really need all this data at once? Some alternatives would be:
- Provide filters to let users narrow down their data before exporting
- Offer scheduled exports that generate large files asynchronously
- Or, don't even have an export and use pagination like discussed above
It's always important to first understand why you are doing something and to question if it's really necessary before you implement the technical solution.
Wrapping up
By leveraging Node.js streams, you can efficiently load and process large datasets without running out of memory. While streaming is powerful, it's important to consider whether an alternative approach (like pagination or filtering) might be more appropriate for your use case.
Knowing how streaming architecture works in Node is one hurdle... understanding how to structure your project is another. Want an Express REST API structure template that makes it clear where your logic should go, and configures basic CI for you? Sign up below to receive that template, plus a post explaining how that structure works / why it's setup that way so you don't have to waste time wondering where your code should go. You'll also receive all my new posts directly to your inbox!
Subscribe for the repo!
No spam ever. Unsubscribe any time.