-
Parsing CSV files in a Cloudflare Worker
I want to share the solution to a problem that stumped me for a day: how to parse CSV files in Cloudflare Workers. In particular, I want to:
- Accept a CSV file from a form-encoded HTTP request
- Parse the CSV file
- Do something with it
I chose to use csv-parse as a CSV parsing library. Like many CSV parsers designed to work server-side, it uses the Node.js Stream API under the hood, which is not supported in the Cloudflare Workers environment without some compatibility layers I did not want to introduce. I needed to use the version of the library intended for browsers, which implements the WHATWG Stream API that is implemented in Workers.
What is going on here
When the Worker recieves a request, we do the following:
- Look for a field called
csvData
in the form data request. Assume this field contains a file. - Set up a
ReadableStream
to consume the file data. - Use a async iterator (
for await
) to consume each chunk of file data streamed into the Worker from the HTTP request - Decode the bytes of file data back into text using a
TextEncoder
. This assumes the file is UTF-8 encoded, which is a good guess. - Feed each chunk of text decoded by the
TextEncoder
back into the CSV parser. - If the CSV parser is busy, block until it emits a
drain
event by callingparser.once('drain', ...)
. This keeps backpressure on the uploaded form data, preventing transfer until we’ve caught up processing the CSV data. - For each row of CSV data produced from the CSV parser, fire a callback function to do something with it. In this simple example, I am just collecting the rows, and returning them to the client.
Why did you make this so complicated?
Ah OK! So this is the interesting part. Instead of all these streams and callbacks, I could have just done something like
const parser = Parser({}); const uploadedText = await request.formData.get('csvData').text() const parsedCsvData = parser.parse(uploadedText); // Do something with parsedCsvData
The problem here is that
await request.formData.get('csvData').text()
waits for the entire CSV file to be uploaded to the Worker, and kept in memory as your CSV is processed. This introduces 2 problems:- CSV processing cannot start until the entire file is uploaded to the worker, slowing down response time.
- Workers are limited to a total of 128MB memory per run; this is a lot, but it does mean you have a hard upper-limit on the size of CSV file you can handle. Keep in mind, that 128MB is shared with all the libraries, request data, and the parsed data you are dealing with.
The Stream API is a powerful system to coordinate the just-in-time movement of data through an application, leaving your memory footprint as small as possible. Streams even allow upstream flow control of the incoming data: even the small sample above can communicate from the CSV parser all the way back down to the client who is uploading the CSV file, telling her system to stop uploading data until its ready for more. Remarkable!
Streams are a deep, interesting technology. I recommend starting with MDN’s exellent introduction to learn more.
-
The Apple Maps Collections URL Scheme
Apple has created a URL scheme to share Apple Maps collections without storing any server-side data. All information necessary to display the collection is stored in a protocol buffer in the URL itself. I’ll show you how to decode the data stored in a collection url, and dig into what it means.
In 2019, Apple maps introduced collections: sharable lists of locations in Apple Maps. iOS natively supports collections, so sharing between iOS devices provides a seamless experience. But Maps.app for macOS doesn’t yet support collections at all. So sharing a collection from iOS to macOS opens a web page with a URL like this:
https://collections.apple.com/collection?_col=ChkzNiBIb3VycyBpbiBQb3J0bGFuZCwgTWUuEl4aNVdlc3Rlcm4gUHJvbWVuYWRlLCBQb3J0bGFuZCwgTUUgIDA0MTAyLCBVbml0ZWQgU3RhdGVzIhIJvol88gvTRUARJ62u%2BaeRUcAqEVdlc3Rlcm4gUHJvbWVuYWRlEg0Irk0Q%2Bsvey9ivxpsPEg4Irk0QyuH475jurPioARJdGjAxODEgU3RhdGUgU3QsIFBvcnRsYW5kLCBNRSAgMDQxMDEsIFVuaXRlZCBTdGF0ZXMiEgmReYlWk9NFQBHVlwBXDZFRwCoVT25lIExvbmdmZWxsb3cgU3F1YXJlEg0Irk0QmarQ4r6omMpiEg4Irk0Qnojwv%2BmwwYjjARINCK5NEMX3iPvSiK7%2BEhIOCK5NEPiehseY6IrDgAESDQiuTRDhnL69jcilpF8SDgiuTRCUtvKfpI%2FujJMBEg0Irk0Qlu2S5PP6wbAaEg0Irk0Qta7tvbWs16w8Eg4Irk0QyprGyqzzuamOARINCK5NEKK7ke%2FBoJyfIhIOCK5NEPPmsNixr4awzwESDQiuTRC3pN3hgdTCjG4SDgiuTRDZ0KWnlan9o5sBEg4Irk0Qz5Dv76HQub%2FdARIOCK5NEKbF7OSsgomCrgESDQiuTRCXk%2Fmpq4icq3oSDgiuTRCu2LSXt8ze3aQBEg4Irk0Qxr%2FOvs%2Bz%2Bo7rARIOCK5NEI7Ko4r1j86osAESDQiuTRDhqoT2tLCayEMSDgiuTRDR38Pkz%2BWjzdEBEg0Irk0QqoS1hc23pPUzEg4Irk0QybPZmtGxzK2YARINCK5NELudnqabgaWNHA%3D%3D
That is one grotesque URL! It is so long because Apple Maps does not require any server-side data to display a shared collection; this URL contains a protocol buffer containing all data necessary to reconstitute the collection. A protocol buffer is a highly compact data serialization format developed and open-sourced by Google(!). It’s a fantastic tool to squash a dictionary-like data structure down into a very small number of bytes. The algorithm to extract the collection data from the above url is as follows:
- URL decode the
_col
query parameter - Base64 decode the resulting string
- Parse the resulting binary data as a protocol buffer with the following format:
Below is a simple python script that will print out data contained in the
_col
parameter of the Apple Maps collection URL above. It assumes I’ve compiled the above protocol into a python module namedmaps_pb2
:This program will print out the following data structure:
name: "36 Hours in Portland, Me." location { address: "Western Promenade, Portland, ME 04102, United States" coordinates { latitude: 43.6488021 longitude: -70.2758774 } name: "Western Promenade" } ... location { address: "181 State St, Portland, ME 04101, United States" coordinates { latitude: 43.6529339 longitude: -70.2664392 } name: "One Longfellow Square" } location { lsp: 9902 appleMapsId: 7103409456625751321 } ... location { lsp: 9902 appleMapsId: 2025093751865052859 }
The data structure is pretty simple: it contains the collection name, and then a
location
structure for every saved location (i.e. pin on the map) in the collection. However, the location structures take two forms: one with a location name, address, and coordinates:location { address: "181 State St, Portland, ME 04101, United States" coordinates { latitude: 43.6529339 longitude: -70.2664392 } name: "One Longfellow Square" }
The other just has an Apple maps and some field that I’m calling
lsp
*:location { lsp: 9902 appleMapsId: 2025093751865052859 }
When constructing a collection, you can either add existing Apple Maps locations, or you can “drop a pin”, name your custom location, and add that. In the former case, your location will just be stored with an id; in the latter, you’ll have the full record stored in the location structure.
These apple map IDs can be accessed by plugging them into Apple Maps URLs like so: https://maps.apple.com/place?auid=APPLE-MAPS-ID
Apple Maps collection URLs provide another data point in Apple’s continued effort to protect their user’s privacy. There is no personally identifying information in the collections URL scheme. Once shared, it is impossible to determine who originally authored a collection. Very ugly URLs, but a pretty attractive user experience.
I’m also encouraged that the Maps team chose to use protocol buffers, a Google open source project, at all. It’s a another great sign that NIH syndrome antibodies are flowing around Cupertino.
* Serialized protocol buffers (like the
_col
parameter in the collections URLs above) do not contain any information about what the fields within them are called. So the protocol definition above is entirely a work of reverse engineering: I have no idea that the fieldaddress
is called actually address inside the Maps application. That said, all fields in the data structure had a pretty obvious meaning except for a mysterious integer,9902
, that always appeared when a location was described only by an id.But if you share you an individual location from Apple Maps, the generated URL always contains the query parameter
lsp=9902
. So I called the mystery field in the protobuf the same thing:lsp
. I’m dying to know why this field is here and what it means; if you’re willing to fill in this blank, please do get in touch. - URL decode the
subscribe via RSS