I came across a situation where I needed to generate a bz2 compressed archive of a bunch of files extracted from GridFS. This process is going to occur regularly, so I had to take into consideration performance hits against the server. I felt it would be best if I could take the data as it is extracted from GridFS and write it directly to the compressed archive, instead of writing each file to disk first, and then adding it to the archive.
Python has a library for generating tarballs called tarfile. This was actually very useful since you can write files directly to a bz2 compressed archive. The issue I ran into is that in order for the data coming out of GridFS to be written to the archive, it had to be written as if it were a file (with file attributes). Using any sort of file IO would force me to write to disk, and that’s not what I wanted to do.
import tarfile import time from StringIO import StringIO tar = tarfile.open("sometarfile.tar.bz2", "w:bz2") for file in gridfs_files: info = tarfile.TarInfo(name="%s" % file.name) info.mtime = time.time() info.size = len(file.data) tar.addfile(info, StringIO(file.data)) tar.close()
The basic idea is you use TarInfo to specify the filename, size, modified time (this is important otherwise tar will complain when the date is older than epoch), etc. You use StringIO to turn your data into an object tarfile will accept, and you use the two to add the file to the archive. This works really well, except for one issue that I am still working on. If you bunzip2 the compressed archive, and then attempt to do a tar -t on it, it hangs and does nothing. It’s possible that gnu tar has a problem, or that the way tarfile is creating the file isn’t correct, but it does decompress and explode properly which is the important part!