One of the most important things that need to be defined for any sort of organized software project is the actual scope of the project. And to do this, the exact syntax or API has to be specified from a top-down approach. This post documents my initial conceptualization of how the Fetcher module should be structured, and how long I believe each feature to be fully implemented within the confines of Google Summer of Code.

Fetcher Code Files

In my proposal, there would be two code files within the Fetch Module: convenience.py (might make it shorter or rename it) and classes.py. classes.py would be a sub-submodule which contains new web retrieval code that could work generically on any web resource, namely this sub-submodule would contain the new Fetcher classes seen below. convenience.py would contain wrapper functions that calls on the new Fetcher classes in a particular way to query a particular database.

The underlying philosophy behind this approach is this give the user two ways to control over how to structure web requests, a messy “sawed-off shotgun approach” for known database targets or a more scalpel-like precision approach for new unknown web queries. The latter would be supported by the Fetcher classes while the former would be handle by the convenience/wrapper functions which calls upon the versatile Fetcher classes.

Fetcher Classes

The Fetcher Classes are a set of classes, following the paradigms of Object-Oriented Programming, which encapsulates the action of fetching a file from the web. There would exist three Fetcher classes. A BaseFetcher will be an Abstract Base Class which has two childs: a StaticFetcher and a DynamicFetcher class. The former representing a Fetcher that would download and cache on this, and the latter will just stream directly from the database – forgoing the cache. Both fetchers would store the data as a file-like object which could be passed into a Universe for processing.

The BaseFetcher ABC will primarily contain a base_url attribute where an HTTP GET request (per RESTFUL API) will be sent by their children. It would also contain all execeptions that will arise for such errors.

class BaseFetcher(ABC):
    # Common attruibtes for all fetchers
    def __init__(self, base_url, progressbar):
        self.base_url = base_url
        self.progressbar = progressbar

    @abstractmethod()
    def get_file(self,)
        # Starts file retrieval workflow
        # meant to be overriden by children

    def _http_get(self):
        # Sends an HTTP GET request to base_url
    :

The StaticFetcher will contain private methods that extends BaseFetcher to allow caching on disk, invisible to the user. This caching would be specified by a cache_path parameter, currently set by default by the module variable DEFAULT_CACHE_NAME_DOWNLOADER in the current version of the Fetch module. Per my prepropsal on Github Disscussion, there would be two ways to save the hashes: a plain text csv or a database file. Both are supported by the python standard library via the csv/sqlite3 modules respectively.

class StaticFetcher(BaseFetcher):
    # Class which downloads and cache file to disk 

    def __init__(self, cache_path):

        super().__init__(base_url, progressbar)
        self.base_url = base_url
        self.progressbar = progressbar

    def get_file(self):
        # Starts file retrieval workflow

    def _save_hash(self):
        # Create/query hash file (either a csv or database file)
    
    def _load_hash(self):
        # Check and loads hash (either a csv or database file)

The DynamicFetcher will contain private methods that extends BaseFetcher that would streams the web resource directly into memory, not storing on cache on disk. This would involve invisibly creating a buffer and managing memory in an intelligence way to prevent excess resource use. Unlike in StaticFetcher, there exists an interesting problem in trying to integrate this with the existing IMDReader reader. In principle, it should be possible to return an iterator of the trajectory given the base_url of the trajectory using IMDReader. This means that it would be possible to just stream the trajectory and perform an analysis within the same script. Currently, the most straightforward way this might work would be the creation of an IMDWriter to simulate a pseudo-MD engine to be able to write the trajectory into memory.

class DynamicFetcher(BaseFetcher):
    # Class which directly streams into memory

    def __init__(self):

        super().__init__(base_url, progressbar)
        self.base_url = base_url
        self.progressbar = progressbar

    def get_file(self):
        # Starts file retrieval workflow

    def _bind_to_socket(self, socket)
        # Binds to socket for IMD Protocol
    

Convenience functions

This suite of functions would contain various utility functions which configure the Fetcher classes for particular databases.

For example, the current fetch.pdb.from_pdb() would be implemented as the following:


def from_pdb(pdb_ids,
    cache_path=None,
    progressbar=False,
    file_format="cif.gz",
    in_memory=False # New boolean!
    ):

    # Preprocessing
    for pdb in pdb_ids:
        proper_url = pdb_ids / file_format

        # Actual code 
        if in_memory:
            return mda.Universe(DynamicFetcher(proper_url, progressbar))
        else:
            return mda.Universe(StaticFetcher(proper_url, cache_path, progressbar))

A list of all potential future databases will be in a future blog post.



<
Blog Archive
Archive of all previous blog posts
>
Blog Archive
Archive of all previous blog posts