clef.esgf¶
Functions for searching the ESGF and matching the results against the MAS database
esgf_query()
performs a query against the ESGF web API.match_query()
performs an outer join of theesgf_query()
results against theclef.model.Path
tablefind_local_path()
andfind_missing_id()
use the results ofmatch_query()
to return the files that are replicated locally and missing from the replica respectively.
-
clef.esgf.
esgf_query
(query, fields, limit=5000, offset=0, distrib=True, replica=False, latest=None, **kwargs)[source]¶ Search the ESGF
Searches the ESGF using its API. Keyword arguments not listed here are passed on to the API search, they can either be single values or lists.
Parameters: - query (str) – Full text query
- fields (list) – Fields to return
- limit (int) – Maximum items to return
- offset (int) – Starting offset of returned items (use with limit for paging)
- distrib (bool) – Distribute the search across all nodes
- replica (bool) – Return replicated datasets
- latest (bool or None) – Return only latest (True), only not latest (False) or all versions (None)
- **kwargs – See the ESGF API docs
Returns: API response from ESGF, decoded from JSON into a Python dict
-
clef.esgf.
find_checksum_id
(query, **kwargs)[source]¶ Get checksums and IDs of matching files from ESGF
Searches ESGF using
esgf_query()
, then converts the response into a SQLAlchemy selectable for further processingParameters: **kwargs – See esgf_query()
Returns: - Values table of matching File objects, containing
- checksum
- id
- dataset_id
- title
- version
This table can be joined against the MAS database tables
-
clef.esgf.
find_local_path
(session, subq, oformat='file')[source]¶ Find the filesystem paths of ESGF matches
Converts the results of
match_query()
to local filesystem paths, either to the file itself or to the containing dataset.Parameters: - format ('file' or 'dataset') – Return the path to the file or the dataset directory
- subq – result of func:esgf_query
Returns: Iterable of strings with the paths to either paths or datasets
-
clef.esgf.
find_missing_id
(session, subq, oformat='file')[source]¶ Returns the ESGF id for each file in the ESGF query that doesn’t have a local match
Parameters: - format ('file' or 'dataset') – Return the path to the file or the dataset directory
- subq – result of func:esgf_query
Returns: Iterable of strings with the ESGF file or dataset id
-
clef.esgf.
link_to_esgf
(query, **kwargs)[source]¶ Convert search terms to a ESGF search URL
Returns a link to the user-facing ESGF web search matching a particular query. This is helpful for error messages, users can follow the URL to find the matches as ESGF sees them
Note that this link is to the ESGF user-facing search page, rather than the web API that
esgf_query()
uses.Parameters: **kwargs – As esgf_query()
Returns: URL to the ESGF search website Return type: str
-
clef.esgf.
match_query
(session, query, latest=None, **kwargs)[source]¶ Match ESGF results against
clef.model.Path
Matches the results of
find_checksum_id()
with thePath
table. If latest is True the checksums will be matched, otherwise only the file name is used in order to spot outdated versions that have been removed from ESGF.Parameters: - latest (bool) – Match the checksums (True) or filenames (False)
- **kwargs – See
esgf_query()
Returns: Joined result of
clef.model.Path
andfind_checksum_id()