How We Deal with Broken ZIP Files: Pick the Lesser of Two Evils

clypd’s workflow involves offline processing of files from our clients and partners. They drop compressed ZIP files to our SFTP servers which are then picked and processed by our workers. The processed data is crucial in powering clypd’s advanced audience targeting platform. Because these files are critical to our business, we have monitors that ensure correct parsing and transformation of the files.

The Problem:

Occasionally our workers would fail to decompress some of those ZIP files. They would run into an unexpected EOF error similar to the one below:

INFO: parsing “TempFile1.csv” within zip file “ZipFile.ZIP”:
panic: process XXX: ParseXXXFiles: Scanner error: unexpected EOF

This error was spooky because every other unzipping tool would unzip the file without any problem. We tried unzipping the file using Unix’s unzip, Window’s 7-zip and macOS’ built-in ZIP tool, and all of them successfully unzipped the file. We used to parse these ZIP files in Rails before our backend team switched to Go. The Rails version of the same worker never encountered an error like the one above. All of this pointed to something weird with Go’s zip package.

Quick Tour of ZIP:

Before we get into the details of why Go’s zip package was failing to parse the ZIP files, it is worth going over what a ZIP file is. A ZIP file can store multiple files in it. Each ZIP file has a central directory at the end that lists all the files archived in it. In addition to the central directory, each file has a local file header at the start of it with metadata about the file itself.

zip64

ZIP-64 Internal Layout

The Culprit:

Further investigation of the problematic ZIP files revealed inconsistent local file headers. One of pieces of metadata that the local file header stores is uncompressed size – how big the file would be when uncompressed. We identified a pattern in all the ZIP files that our worker failed to parse:

  1. The ZIP archives contained at least one file that was bigger than 4GB when uncompressed
  2. The uncompressed size for that big file in the local file header was always smaller than its actual uncompressed size

The above pattern pointed to a cause as old as the ZIP format itself. The “uncompressed size” is a 32-bit field in the local file header. Consequently the field simply doesn’t have enough bits to store sizes bigger than 4GB! Since the problematic ZIP archives had at least one file > 4GB when uncompressed, the inconsistency in uncompressed size was bound to occur for those ZIP files.

ZIP formats evolved to address exactly this problem. Later versions of ZIP added a 64-bit version of the “uncompressed size” often referred to as UncompressedSize64. However, the problematic ZIP files we were getting had both 32-bit and 64-bit uncompressed sizes set to same incorrect value.

It turns out that all the decoders that we tested ignored the uncompressed size metadata (both 32-bit and 64-bit versions) except Go’s zip package. Since Go is entirely open-sourced, we found the exact spot in the zip package where it was using the UncompressedSize64 metadata.

Solution 0: Fix the ZIP files and/or unzip library

We considered asking our clients to use zip encoders that correctly set 64-bit uncompressed sizes, but that seemed unscalable as we add more clients. We also considered fixing Go’s unzip library upstream. We didn’t pursue it because an upstream change could take months and we needed a fix immediately.

Solution 1 – Use UNIX unzip

Using UNIX’s unzip tool was an obvious candidate for solving the problem since that’s how we debugged it in first place. Unzip doesn’t care about the uncompressed size in the local file header during unzipping.

Unzipping header

The worker would read file from SFTP server and pass the compressed bytes to UNIX unzip using go’s exec.Command. Unzip would uncompress the files into temporary files on disk. It would be  necessary to create these temporary files because some of these files could be bigger than the available system RAM. The worker would then gzip and tarball these files before uploading them to s3.

The unzip approach has some serious flaws. Calling an outside tool like unzip means added steps for deployment- we now have to install specific working version of unzip in all of our farms as well as local dev machines. This is painful considering that all of our Go apps compile to single executable binaries. Another issue with the above approach is that we gzip the files before tarballing. The usual convention is to tar the uncompressed files first before gzipping.

Solution 2 – Fork Go’s archive/zip package

Another obvious solution to our ZIPs problem would be to simply remove the check that ensures that the size of uncompressed bytes is equal to the one in the local file header. This does come with its own pitfalls though. Having a modified fork means that merging upstream changes becomes non-trivial. This also means that we now have two zip packages in our code repo. We addressed this by making the forked/hacked zip an internal package– this limited the internal package’s scope to just the package’s siblings and their child nodes.

Picking the lesser of two evils:

The fork+hack approach is significantly better than unzip approach because it needs no changes to our deployment process. It is also easier to test since we don’t need to bother talking to the file system or some outside UNIX/Linux tool.

Because of the above mentioned pros, we ended up forking archive/zip to address the inconsistent local file header issue in the ZIP files. It has been almost ten months since the worker switched to using the forked ZIP package and it hasn’t run into the same issue a single time.

Since we continue to get big ZIP files, some of those files most likely have broken file headers. With the fix in place though, we can safely use even those ZIP files to power clypd’s platform.

Leave a Reply