Writing Files From GridFS To An Archive Without Writing To Disk First

I came across a situation where I needed to generate a bz2 compressed archive of a bunch of files extracted from GridFS. This process is going to occur regularly, so I had to take into consideration performance hits against the server. I felt it would be best if I could take the data as it is extracted from GridFS and write it directly to the compressed archive, instead of writing each file to disk first, and then adding it to the archive.

Python has a library for generating tarballs called tarfile. This was actually very useful since you can write files directly to a bz2 compressed archive. The issue I ran into is that in order for the data coming out of GridFS to be written to the archive, it had to be written as if it were a file (with file attributes). Using any sort of file IO would force me to write to disk, and that’s not what I wanted to do.

Luckily there is StringIO. Using this in combination with tarfile’s TarInfo object, this became a very easy task to accomplish:

import tarfile
import time
from StringIO import StringIO

tar = tarfile.open("sometarfile.tar.bz2", "w:bz2")
for file in gridfs_files:
    info = tarfile.TarInfo(name="%s" % file.name)
    info.mtime = time.time()
    info.size = len(file.data)
    tar.addfile(info, StringIO(file.data))

The basic idea is you use TarInfo to specify the filename, size, modified time (this is important otherwise tar will complain when the date is older than epoch), etc. You use StringIO to turn your data into an object tarfile will accept, and you use the two to add the file to the archive. This works really well, except for one issue that I am still working on. If you bunzip2 the compressed archive, and then attempt to do a tar -t on it, it hangs and does nothing. It’s possible that gnu tar has a problem, or that the way tarfile is creating the file isn’t correct, but it does decompress and explode properly which is the important part!

Determining the Right CND Tool for a Job

Throughout the day a SOC team uses dozens of tools to complete tasks in a few minutes that would normally take much longer. Tools improve...… Continue reading

CRITs Authentication

Published on June 24, 2014

CRITs: Collaborative Research Into Threats

Published on June 18, 2014