Saturday, March 15, 2008

stored_result(): automatic serialization of expensive calculations

A very common situation I encounter: "Do [expensive calculation] and pickle the result; unless it's been done before, in which case unpickle the stored result." This stored_result() utility function makes that very easy, and provides some convenient options for scripting and memory management."


def stored_results(pkl_filename, function, overwrite=0, verbose=1,
          lazy=False, args=[], kwargs={}):
    """Returns contents of pickled file 'pkl_filename'. If file does not exist
    or is malformed, returns output of function (with args & kwargs) and pickles same
    result to pkl_filename.
    If overwrite>0, runs function regardless of presence of 'pkl_filename'.
    If lazy=True, instead outputs a generator that evaluates the rest of the stored_results()
    call only when its next() method is called."""
    import cPickle
    if lazy:
        # Create a generator that will eventually retrieve stored result
        def result_gen():
            result = stored_results(pkl_filename, function, overwrite, verbose,
                                    False, args, kwargs)
            while 1:
                yield result
        return result_gen()
    else:
        if not overwrite>0:
            try:
                inf = file(pkl_filename, 'rb')
                result = cPickle.load(inf)
                inf.close()
                if verbose: print pkl_filename, "successfully unpickled"
                return result
            except (cPickle.UnpicklingError, IOError):
                if verbose: print "Result not stored. Running function", function.func_name
        result = function(*args, **kwargs)
        outf = file(pkl_filename, 'wb')
        cPickle.dump(result, outf, -1)
        outf.close()
        if verbose: print "Result pickled to", pkl_filename
        return result


Typical use:
result = stored_results('saved_result.pkl', expensive_function, args=(23, 'sweet'))

Advanced use:
To perform the calculation/load the result in a just-in-time fashion:
result_gen = stored_results('saved_result.pkl', expensive_function, args=(23, 'sweet'), lazy=True)
[memory-sensitive tasks that may or may not need the result]
result = result_gen.next()

Victory is mine!

I defended and graduated! PhD in Physics from MIT. Now it gets good.

I crossed the street to the Broad Institute to do cancer research. Now, I don't interact with patients, or even deal with tumor samples. My job is to manage and analyze the huge flow of data coming from the latest microarray experiments. We look at sequence, expression, methylation, copy-number variation, and microRNA data simultaneously to tease out the subtle cellular changes that result in cancer.

When motivation strikes, I'll write about some of the programming and other issues I deal with at work.