Parsing CSV files in a Cloudflare Worker
I want to share the solution to a problem that stumped me for a day: how to parse CSV files in Cloudflare Workers. In particular, I want to:
- Accept a CSV file from a form-encoded HTTP request
- Parse the CSV file
- Do something with it
I chose to use csv-parse as a CSV parsing library. Like many CSV parsers designed to work server-side, it uses the Node.js Stream API under the hood, which is not supported in the Cloudflare Workers environment without some compatibility layers I did not want to introduce. I needed to use the version of the library intended for browsers, which implements the WHATWG Stream API that is implemented in Workers.
What is going on here
When the Worker recieves a request, we do the following:
- Look for a field called
csvData
in the form data request. Assume this field contains a file. - Set up a
ReadableStream
to consume the file data. - Use a async iterator (
for await
) to consume each chunk of file data streamed into the Worker from the HTTP request - Decode the bytes of file data back into text using a
TextEncoder
. This assumes the file is UTF-8 encoded, which is a good guess. - Feed each chunk of text decoded by the
TextEncoder
back into the CSV parser. - If the CSV parser is busy, block until it emits a
drain
event by callingparser.once('drain', ...)
. This keeps backpressure on the uploaded form data, preventing transfer until we’ve caught up processing the CSV data. - For each row of CSV data produced from the CSV parser, fire a callback function to do something with it. In this simple example, I am just collecting the rows, and returning them to the client.
Why did you make this so complicated?
Ah OK! So this is the interesting part. Instead of all these streams and callbacks, I could have just done something like
const parser = Parser({});
const uploadedText = await request.formData.get('csvData').text()
const parsedCsvData = parser.parse(uploadedText);
// Do something with parsedCsvData
The problem here is that await request.formData.get('csvData').text()
waits for the entire CSV file to be uploaded to the Worker, and kept in memory as your CSV is processed. This introduces 2 problems:
- CSV processing cannot start until the entire file is uploaded to the worker, slowing down response time.
- Workers are limited to a total of 128MB memory per run; this is a lot, but it does mean you have a hard upper-limit on the size of CSV file you can handle. Keep in mind, that 128MB is shared with all the libraries, request data, and the parsed data you are dealing with.
The Stream API is a powerful system to coordinate the just-in-time movement of data through an application, leaving your memory footprint as small as possible. Streams even allow upstream flow control of the incoming data: even the small sample above can communicate from the CSV parser all the way back down to the client who is uploading the CSV file, telling her system to stop uploading data until its ready for more. Remarkable!
Streams are a deep, interesting technology. I recommend starting with MDN’s exellent introduction to learn more.