CleF - Climate Finder search ESGF data at NCI¶
Contents:
README¶
Clef searches the Earth System Grid Federation datasets stored at the Australian National Computational Infrastructure, both data published on the NCI ESGF node as well as files that are locally replicated from other ESGF nodes.
Currently it searches for the following datasets:
- CMIP5 raijin projects: rr3, where NCI is the primary publisher and al33 for replicas
- CMIP6 raijin projects: 0i10 for replicas
The search returns both the path of data that is already available at NCI as well as information on data that is on external ESGF nodes but not yet available locally.
Install¶
Clef is pre-installed into a Conda environment at NCI. Load it with:
module use /g/data3/hh5/public/modules
module load conda/analysis3-unstable
- We are constantly adding new features, the development version is available in a separate environment::
- module use /g/data3/hh5/public/modules module load conda source activate clef-test
You can install it to your own environment with:
conda install -c coecms -c conda-forge clef
But note that the MAS database necessary for running clef
can only be accessed
from NCI systems
Use¶
clef cmip5¶
Find CMIP5 files matching the constraints:
clef cmip5 --model BCC-CSM1.1 --variable tas --experiment historical --table day
You can filter CMIP5 by the following terms:
- ensemble/member
- experiment
- experiment-family
- institution
- model
- table/cmor_table
- realm
- frequency
- variable
- cf-standard-name
See clef cmip5 --help
for all available filters and their aliases
--latest
will check the latest versions of the datasets on the ESGF
website, and will only return matching files
It will return a path for all the files available locally at NCI and a dataset-id for the ones that haven’t been downloaded yet.
You can use the flags --local
and --missing
to return respectively only the local paths or the missing dataset-id:
clef --local cmip5 --model MPI-ESM-LR --variable tas --table day
clef --missing cmip5 --model MPI-ESM-LR --variable tas --table day
NB these flags come immediately after the command “clef” and before the sub-command “cmip5” or “cmip6”. They are also clearly mutually exclusive. You can repeat arguments more than once:
clef --missing cmip5 --model MPI-ESM-LR -v tas -v tasmax -t day -t Amon
clef cmip6¶
You can filter CMIP6 by the following terms:
- activity
- experiment
- institution
- source_type
- model
- member
- table
- realm
- frequency
- variable
- version
See clef cmip6 --help
for all available filters
Develop¶
Development install:
conda env create -f conda/dev-environment.yml
source activate clef-dev
pip install -e '.[dev]'
The dev-environment.yml file is for speeding up installs and installing packages unavailable on pypi, requirements.txt is the source of truth for dependencies.
To work on the database tables you may need to start up a test database.
You can start a test database either with Docker:
docker-compose up # (In a separate terminal)
psql -h localhost -U postgres -f db/nci.sql
psql -h localhost -U postgres -f db/tables.sql
# ... do testing
docker-compose rm
Or with Vagrant:
vagrant up
# ... do testing
vagrant destroy
Run tests with py.test (they will default to using the test database):
py.test
Build the documentation using Sphinx:
python setup.py build_sphinx
firefox docs/_build/index.html
New releases are packaged and uploaded to anaconda.org by CircleCI when a new Github release is made
Documentation is available on ReadTheDocs, both for stable and latest versions.
Getting Started¶
CleF is presently installed in an anaconda environment, which must be loaded before use (on either VDI or Raijin):
$ module use /g/data3/hh5/public/modules
$ module load conda/analysis3-unstable
NB there is a clef version available on analysis3 but the one in unstable is more recent and has fixes for some bugs.
clef is accessed through the command-line clef program. There are presently two main commands:
clef cmip5
to execute searches on the CMIP5 datasetclef cmip6
to execute searches on the CMIP6 datasetclef ds
to execute searches on non-ESGF climate datasets
Examples¶
The search works like the ESGF search website, e.g. https://esgf.nci.org.au/search/esgf_nci. Results can be filtered by using flags matching the ESGF search facets.
CMIP5¶
- ::
- $ clef cmip5 –model ACCESS1.0
- –experiment historical –frequency mon –variable ua –variable va
/g/data1/rr3/publications/CMIP5/output1/CSIRO-BOM/ACCESS1-0/historical/mon/atmos/Amon/r1i1p1/v20120727/ua/ /g/data1/rr3/publications/CMIP5/output1/CSIRO-BOM/ACCESS1-0/historical/mon/atmos/Amon/r1i1p1/v20120727/va/ /g/data1/rr3/publications/CMIP5/output1/CSIRO-BOM/ACCESS1-0/historical/mon/atmos/Amon/r2i1p1/v20130726/ua/ /g/data1/rr3/publications/CMIP5/output1/CSIRO-BOM/ACCESS1-0/historical/mon/atmos/Amon/r2i1p1/v20130726/va/ /g/data1/rr3/publications/CMIP5/output1/CSIRO-BOM/ACCESS1-0/historical/mon/atmos/Amon/r3i1p1/v20140402/ua/ /g/data1/rr3/publications/CMIP5/output1/CSIRO-BOM/ACCESS1-0/historical/mon/atmos/Amon/r3i1p1/v20140402/va/
Everything available on ESGF is also available locally
CMIP6¶
- ::
- $ clef cmip6 –activity CMIP
- –experiment historical –source_type AOGCM –table Amon –grid gr –resolution “250 km” –variable ua –variable va
/g/data1b/oi10/replicas/CMIP6/CMIP/CNRM-CERFACS/CNRM-CM6-1/historical/r1i1p1f2/Amon/ua/gr/v20180917/ /g/data1b/oi10/replicas/CMIP6/CMIP/CNRM-CERFACS/CNRM-CM6-1/historical/r1i1p1f2/Amon/va/gr/v20180917/
Available on ESGF but not locally: CMIP6.CMIP.CNRM-CERFACS.CNRM-CM6-1.historical.r2i1p1f2.Amon.ua.gr.v20181126 CMIP6.CMIP.CNRM-CERFACS.CNRM-CM6-1.historical.r2i1p1f2.Amon.va.gr.v20181126
ds¶
$ clef ds -f netcdf --standard-name air_temperature
ta: /g/data/ub4/erai/netcdf/6hr/atmos/oper_an_pl/1.0/ta/ta_6hr_ERAI_historical_oper_an_pl_<YYYYMMDD>_<YYYYMMDD>.nc
tas: /g/data/ub4/erai/netcdf/6hr/atmos/oper_an_sfc/1.0/tas/tas_6hr_ERAI_historical_oper_an_sfc_<YYYYMMDD>_<YYYYMMDD>.nc
ta: /g/data/ub4/erai/netcdf/6hr/atmos/oper_an_ml/1.0/ta/ta_6hr_ERAI_historical_oper_an_ml_<YYYYMMDD>_<YYYYMMDD>.nc
mn2t: /g/data/ub4/erai/netcdf/3hr/atmos/oper_fc_sfc/1.0/mn2t/mn2t_3hr_ERAI_historical_oper_fc_sfc_<YYYYMMDD>_<YYYYMMDD>.nc
mx2t: /g/data/ub4/erai/netcdf/3hr/atmos/oper_fc_sfc/1.0/mx2t/mx2t_3hr_ERAI_historical_oper_fc_sfc_<YYYYMMDD>_<YYYYMMDD>.nc
tas: /g/data/ub4/erai/netcdf/3hr/atmos/oper_fc_sfc/1.0/tas/tas_3hr_ERAI_historical_oper_fc_sfc_<YYYYMMDD>_<YYYYMMDD>.nc
ESGF command line search¶
Four optional flags are available for the cmip5 and cmip6 commands to change the output or submit a data request:
clef --remote cmip5
returns all the ESGF CMIP5/CMIP6 datasets matching the constraints, it is the equivalent of doing a search online on an ESGF nodeclef --local cmip5
finds local files accessing directly the NCI’s MAS database so it will also return older versions or datasets that might be temporarily offline.clef --missing cmip6
finds files on ESGF that haven’t been downloaded to NCIclef --request cmip6
create and pass to NCI a request to download the missing files
If these flags are omitted then the tool will search on the ESGF datasets matching the constraints and return both the local and missing files lists, based on searching an ESGF node.
The search works like the ESGF search website, e.g. https://esgf.nci.org.au/search/esgf_nci. Results can be filtered by using flags matching the ESGF search facets:
$ clef cmip5 --model ACCESS1.0 \
--experiment historical \
--frequency mon \
--variable ua \
--variable va
If the same flag is used multiple times both terms will be searched for.
Please note that CMIP5 and CMIP6 have different names and number of flags, we tried to use the same names wherever possible. In particular CMIP6 has some new flags available:
$ clef cmip6 --activity CMIP \
--experiment historical \
--source_type AOGCM \
--table Amon \
--grid gr \
--resolution "250 km" \
--variable ua \
--variable va
`activity` - MIPS or sub-projects, for example CMIP refers to the DECK group of experiments
`source_type` - model type, in the example above AOGCM is coupled Atmosphere-Ocean Global Climate Model
`grid` - grid kind, in the example 'gr' stands for "regridded data reported on the data provider's preferred target grid"
`resolution` - nominal resolution of the grid, there are two kind of nominal resolution.
If the value is in degrees then this is a standard CMIP6 grid, currently only "1x1 degree" is available.
If the resolution is in kms then this is an approximate resolution. Details are available in the appendix 2 of the CMIP6 attributes documentation: https://goo.gl/v1drZl
Note that resolution is always composed of two separate words and will need to be passed as a string enclosed in quotes "".
When querying the ESGF website, the total amount of results is limited to 5,000 files. If clef finds more results it will ask you to refine your query. You can follow the link to see the query clef used on the ESGF website:
$ clef cmip5
Exception: Too many results (1030069), try limiting your search
https://esgf.nci.org.au/search/esgf_nci?query=&distrib=on&latest=on&project=CMIP5
Options¶
clef –missing¶
clef --missing <dataset>
searches ESGF for files that haven’t been downloaded to
NCI. It returns ESGF dataset IDs for each dataset that has one or more missing files:
$ clef --missing cmip5 --model HadCM3 --experiment historical \
--table day --ensemble r1i1p1 \
--variable ta
Available on ESGF but not locally:
cmip5.output2.MOHC.HadCM3.historical.day.atmos.day.r1i1p1.v20110728
- NOTE: ESGF keeps track of only the most recent versions of each file for a given dataset version,
- so if the files in the NCI mirror and ESGF don’t match this command can return false positives.
clef –local¶
clef --local <dataset>
searches the local file system for files that have been
downloaded to NCI. It returns the path to the file on NCI’s /g/data disk:
$ clef --local cmip5 --model HadCM3 --experiment historical --table day --ensemble r1i1p1 \
--variable ta --all-versions
/g/data1/ua6/unofficial-ESG-replica/tmp/tree/cmip-dn1.badc.rl.ac.uk/thredds/fileServer/esg_dataroot/cmip5/output1/MOHC/HadCM3/historical/day/atmos/day/r1i1p1/v20110728/ta/
/g/data1/ua6/unofficial-ESG-replica/tmp/tree/esgf-data1.ceda.ac.uk/thredds/fileServer/esg_dataroot/cmip5/output1/MOHC/HadCM3/historical/day/atmos/day/r1i1p1/v20140110/ta/
NOTE: Presently the default behaviour for all the ESGF-node based searches is to check for the most recent (latest) version on ESGF, and return only files with that version. This can be disabled with the --all-versions
flag.
The –local option instead currently returns by default all available versions, including versions unpublished by the ESGF but that are still available locally,
Most of the older CMIP5 collection (ua6 project) has been replaced by the new one (al33i project), this does not include older or superceded versions.
If you are looking for one of these versions you could try using the ARCCSSive module https://github.com/coecms/arccssive to locate it or ask the helpdesk.
tips¶
If your search does not return any results try again at a later time. The tool is searching the ESGF website first and sometimes one or more nodes can be disconnected and the returned results are incomplete. Try the –local flag to at least get what is available locally. For CMIP5 you can use the older ARCCSSive tool if in doubt.
Climate collections command line search¶
The ds command is a new feature of clef and we are still defining its behaviour. clef ds with no other argument will return a list of the local datasets available in the database. NB this is not an exhaustive list of the climate collections at NCI and not all the datasets alredy in the database have been completed.:
$ clef ds --help
Usage: clef ds [OPTIONS]
Search local database for non-ESGF datasets
Options:
-d, --dataset TEXT Dataset name
-v, --version TEXT Dataset version
-f, --format [netcdf|grib|HDF5|binary]
Dataset file format as defined in clef.db
Dataset table
-sn, --standard-name [air_temperature|air_pressure|rainfall_rate]
Variable standard_name this is the most
reliable way to look for a variable across
datasets
-cn, --cmor-name [ps|pres|psl|tas|ta|pr|tos]
Variable cmor_name useful to look for a
variable across datasets
-va, --variable [T|U|V|Z] Variable name as defined in files: tas, pr,
sic, T ...
--frequency [yr|mon|day|6hr|3hr|1hr]
Time frequency on which variable is defined
--from-date TEXT To define a time range of availability of a
variable, can be used on its own or
together with to-date. Format is YYYYMMDD
--to-date TEXT To define a time range of availability of a
variable,
can be used on its own or
together with from-date. Format is YYYYMMDD
--help Show this message and exit.
shows the available arguments, if you specify any of the variable options then the search will return a list of variables rather then datasets. Since variables can be named differently among datasets, using the standard_nameor cmor_name options to identify them, if available, is the best option.
Examples¶
- ::
- $ clef ds -f netcdf –standard-name air_temperature
- ta: /g/data/ub4/erai/netcdf/6hr/atmos/oper_an_pl/1.0/ta/ta_6hr_ERAI_historical_oper_an_pl_<YYYYMMDD>_<YYYYMMDD>.nc tas: /g/data/ub4/erai/netcdf/6hr/atmos/oper_an_sfc/1.0/tas/tas_6hr_ERAI_historical_oper_an_sfc_<YYYYMMDD>_<YYYYMMDD>.nc ta: /g/data/ub4/erai/netcdf/6hr/atmos/oper_an_ml/1.0/ta/ta_6hr_ERAI_historical_oper_an_ml_<YYYYMMDD>_<YYYYMMDD>.nc mn2t: /g/data/ub4/erai/netcdf/3hr/atmos/oper_fc_sfc/1.0/mn2t/mn2t_3hr_ERAI_historical_oper_fc_sfc_<YYYYMMDD>_<YYYYMMDD>.nc mx2t: /g/data/ub4/erai/netcdf/3hr/atmos/oper_fc_sfc/1.0/mx2t/mx2t_3hr_ERAI_historical_oper_fc_sfc_<YYYYMMDD>_<YYYYMMDD>.nc tas: /g/data/ub4/erai/netcdf/3hr/atmos/oper_fc_sfc/1.0/tas/tas_3hr_ERAI_historical_oper_fc_sfc_<YYYYMMDD>_<YYYYMMDD>.nc
This returns all the variable available as netcdf files and with air_temperature as standard_name. NB for each variable a path structure is returned.:
$ clef ds -f netcdf --cmor-name ta
ta: /g/data/ub4/erai/netcdf/6hr/atmos/oper_an_pl/1.0/ta/ta_6hr_ERAI_historical_oper_an_pl_<YYYYMMDD>_<YYYYMMDD>.nc
ta: /g/data/ub4/erai/netcdf/6hr/atmos/oper_an_ml/1.0/ta/ta_6hr_ERAI_historical_oper_an_ml_<YYYYMMDD>_<YYYYMMDD>.nc
This returns a subset of the previous search using the cmor_name to clearly identify one kind of air_temperature.
Integrating the ESGF search in your code¶
The code sub-module contains functions which are used to run –local option and can be used to integrate this query in your own python scripts:
from clef.code import *
After importing them you need to open a connection with the NCI MAS database to be able to run your queries:
db = connect()
s = Session()
The search function takes 3 inputs: the db session, the project (i.e. currently ‘cmip5’ or ‘cmip6’) and a dictionary containing the query constraints.:
results = search(s, project='cmip5', **constraints)
The keys available to define your constraints depend on the project you are querying and the attributes stored by the database. You can use any of the facets used for ESGF but in future we will be adding other options based on extra fields which are stored as attributes.
Examples¶
- ::
- constraints = {‘variable’: ‘tas’, ‘model’: ‘MIROC5’, ‘cmor_table’: ‘day’, ‘experiment’: ‘rcp85’} results = search(s, project=’cmip5’, **constraints) results[0] {‘filenames’: [‘tas_day_MIROC5_rcp85_r1i1p1_20060101-20091231.nc’, ‘tas_day_MIROC5_rcp85_r1i1p1_20500101-20591231.nc’, ‘tas_day_MIROC5_rcp85_r1i1p1_20200101-20291231.nc’, ‘tas_day_MIROC5_rcp85_r1i1p1_20800101-20891231.nc’, ‘tas_day_MIROC5_rcp85_r1i1p1_20600101-20691231.nc’, ‘tas_day_MIROC5_rcp85_r1i1p1_20100101-20191231.nc’, ‘tas_day_MIROC5_rcp85_r1i1p1_20900101-20991231.nc’, ‘tas_day_MIROC5_rcp85_r1i1p1_20700101-20791231.nc’, ‘tas_day_MIROC5_rcp85_r1i1p1_20400101-20491231.nc’, ‘tas_day_MIROC5_rcp85_r1i1p1_20300101-20391231.nc’, ‘tas_day_MIROC5_rcp85_r1i1p1_21000101-21001231.nc’], ‘project’: ‘CMIP5’, ‘institute’: ‘MIROC’, ‘model’: ‘MIROC5’, ‘experiment’: ‘rcp85’, ‘frequency’: ‘day’, ‘realm’: ‘atmos’, ‘r’: ‘1’, ‘i’: ‘1’, ‘p’: ‘1’, ‘ensemble’: ‘r1i1p1’, ‘cmor_table’: ‘day’, ‘version’: ‘20120710’, ‘variable’: ‘tas’, ‘pdir’: ‘/g/data1b/al33/replicas/CMIP5/output1/MIROC/MIROC5/rcp85/day/atmos/day/r1i1p1/v20120710/tas’, ‘periods’: [(‘20060101’, ‘20091231’), (‘20500101’, ‘20591231’), (‘20200101’, ‘20291231’), (‘20800101’, ‘20891231’), (‘20600101’, ‘20691231’), (‘20100101’, ‘20191231’), (‘20900101’, ‘20991231’), (‘20700101’, ‘20791231’), (‘20400101’, ‘20491231’), (‘20300101’, ‘20391231’), (‘21000101’, ‘21001231’)], ‘fdate’: ‘20060101’, ‘tdate’: ‘21001231’, ‘time_complete’: True}
search returns a list of dictionary, one for each dataset. You can see from the first result the dictionary content, the last key time_complete is the result of a check run on the time axis beuilt by joining together the files periods. If the time axis is contiguos is true, otherwise is False. NB that this has been calculated only using the dates listed in the files, the actual timesteps have not been checked.
Both the keys and values of the constraints get checked before being passed to the query function. This means that if you passed a key or a value that does not exist for the chosen project, the function will print a list of valid values and then exit. Let’s re-write the constraints dictionary to show an example.:
constraints = {'v': 'tas', 'm': 'MIROC5', 'table': 'day', 'e': 'rcp85', 'activity':'CMIP'}
results = search(s, project='cmip5', **constraints)
Warning activity is not a valid constraint name
Valid constraints are:
dict_values([['source_id', 'model', 'm'], ['realm'], ['time_frequency', 'frequency', 'f'], ['variable_id', 'variable', 'v'], ['experiment_id', 'experiment', 'e'], ['table_id', 'table', 'cmor_table', 't'], ['member_id', 'member', 'ensemble', 'en', 'mi'], ['institution_id', 'institution', 'institute'], ['experiment_family']])
You can see that the function told us ‘activity’ is not a valid constraints for CMIP5, in fact that can be used only with CMIP6 NB. that the search accepted all the other abbreviations, we allowed more than one term to be used for each key. The full list is available from the github repository: https://github.com/coecms/clef/blob/master/clef/data/valid_keys.json
More complex queries We are adding functions that can facilitate more complex queries, an example is the ‘matching’ function It is easier to understand how matching work starting from an example. A user might want to get all the model/ensemble combinations which have both tasmin and tasmax To do this use the standard query I would have to do pass these constraints to a query :: constraints = {‘variable’: ‘tasmin’, ‘cmor_table’: ‘day’, ‘experiment’: ‘rcp85’} found all the model/ensemble which have tasmin / rcp85 / day then repeat the same for ‘tasmax’ and finally check which model/ensemble combinations have both. The ‘matching’ function simplify all of this. First of all I can pass to it multiple values:
constraints = {'variable': ['tasmin','tasmax'], 'cmor_table': ['day'], 'experiment': ['rcp85']}
- Then I need define the attribute for which I want all the values to be present::
- allvalues=[‘variable’]
- I need to define what are the attributes whose combination define a simulation, model and ensemble, i.e. each model/ensemble combination define a simulation, in some cases you might want to add to these also the version::
- fixed=[‘model’,’ensemble’]
- Finally we call matching::
- results, selection = matching(s, allvalues, fixed, **constraints)
The function returns two lists, the first ‘results’ contains a dictionary for each simulation that has either tasmin or tasmax for {rcp85, day}. The second ‘selection’ has only the simulations that has both ‘tasmin’ and ‘tasmax’. Other examples Find simulations which have ‘tasmin’ and ‘tasmax’ and both ‘rcp85’ and ‘rcp45’ experiments:
constraints = {'variable': ['tasmin','tasmax'], 'cmor_table': ['day'], 'experiment': ['rcp85', 'rcp45']}
allvalues=['variable', 'experiment']
fixed=['model','ensemble']
results, selection = matching(s, allvalues, fixed, **constraints)
- Find simulations which have ‘tasmin’ and ‘tasmax’ for either ‘rcp85’ or ‘rcp45’ experiments::
- constraints = {‘variable’: [‘tasmin’,’tasmax’], ‘cmor_table’: [‘day’], ‘experiment’: [‘rcp85’, ‘rcp45’]} allvalues=[‘variable’] fixed=[‘model’,’ensemble’, ‘experiment’] results, selection = matching(s, allvalues, fixed, **constraints)
By default we are searching for CMIP5 if we want to do the same for CMIP6 we need to change the project value and use the right facet names:: Find simulations which have ‘tasmin’ and ‘tasmax’ for ‘piControl’ experiment:
constraints = {'variable_id': ['tasmin','tasmax'], 'table_id': ['day'], 'experiment_id': ['piControl']}
allvalues=['variable_id']
fixed=['source_id','member_id']
results, selection = matching(s, allvalues, fixed, project='CMIP6', **constraints)
- In particular for CMIP6, for which data is still getting published, you might want to execute the same search on the remote ESGF data catalogue rather than locally. In that case we change the ‘local’ argument from its default value True to False::
- constraints = {‘variable_id’: [‘tasmin’,’tasmax’], ‘table_id’: [‘day’], ‘experiment_id’: [‘piControl’]} allvalues=[‘variable_id’] fixed=[‘source_id’,’member_id’] results, selection = matching(s, allvalues, fixed, project=’CMIP6’, local=False, **constraints)
NB currently using the abbreviated version for the constraints won’t work, you will have to use the attributes full names.
Architecture¶

clef –missing¶
Resolve any constraint wildcards by looking for matches in the local database, e.g.:
SELECT DISTINCT model FROM esgf_dataset WHERE model ILIKE 'ACCESS%' ;Call find_missing_id() with the resolved constraints
- Search ESGF using the constraints, returning the checksum of each matching file
- Match the ESGF checksums against the local metadata database
- Return the ESGF id for any files whose checksums cannot be found in the local database
clef –local¶
Query the local database for files:
SELECT path FROM esgf_paths NATURAL JOIN esgf_metadata_dataset_link NATURAL JOIN esgf_dataset WHERE model ILIKE 'ACCESS%' -- ... ;
- If using the
--latest
flag, query ESGF using the constraints toretrieve checksums, match these checksums against the local results and return only those found. This is the default behaviour
CleF API¶
clef.db¶
Database connection functions
-
class
clef.db.
Session
¶ sqlalchemy.orm.session.Session
connected to the MAS databaseconnect()
must be called before creating any new sessions
-
clef.db.
connect
(url='postgresql://clef.nci.org.au:5432/postgres', user=None, debug=False)[source]¶ Connect to the MAS database and sets up the session
Parameters: - url – Database URL
- user – Username (password will be prompted via
getpass
) - debug – Print debugging information
Returns: sqlalchemy.engine.Engine
clef.model¶
Model of NCI’s MAS database
The MAS database has two main tables - path
and metadata
. These base
tables are available in the model as Path
and Metadata
, they
have a SQLAlchemy relationship so that the two table can be joined in queries.
There may be multiple Metadata
entries for a single Path
,
these represent different metadata types, such as checksums, netCDF attributes
and POSIX file attributes. The type can be identified from
Metadata.type
, and is used as a polymorphic identity to SQLAlchemy’s
single table inheritance,
creating the Checksum
, Netcdf
and Posix
models.
The C5Dataset
and C6Dataset
models represent datasets like
you would find on ESGF, although without a version. They are created in the
database from a DISTINCT
view of the NetCDF attributes, and can be used to
group paths on the filesystem into datasets.
-
class
clef.model.
C5Dataset
(**kwargs)[source]¶ A CMIP5-era ESGF dataset
This class only has access to attributes from the file itself, so version information is not present.
See the CMIP documentation for descriptions of the attributes
-
cmor_table
¶
-
ensemble
¶
-
experiment
¶
-
institute
¶
-
model
¶
-
project
¶
-
realm
¶
-
time_frequency
¶
-
-
class
clef.model.
C6Dataset
(**kwargs)[source]¶ A CMIP6-era ESGF dataset
This class only has access to attributes from the file itself, so version information is not present.
See the CMIP documentation for descriptions of the attributes
-
activity_id
¶
-
experiment_id
¶
-
frequency
¶
-
grid_label
¶
-
institution_id
¶
-
member_id
¶
-
nominal_resolution
¶
-
project
¶
-
realm
¶
-
source_id
¶
-
source_type
¶
-
sub_experiment_id
¶
-
table_id
¶
-
variable_id
¶
-
variant_label
¶
-
-
class
clef.model.
Checksum
(**kwargs)[source]¶ Checksum of a file on Raijin
-
md5
¶ md5 checksum
-
sha256
¶ sha256 checksum
-
-
class
clef.model.
ExtendedMetadata
(**kwargs)[source]¶ Extra metadata not present in the file’s attributes
-
class
clef.model.
Info
(**kwargs)[source]¶ General information about a dataset file
This is a database view, its columns shouldn’t be used for searching as they are large and not indexed.
-
contact
¶
-
description
¶
-
further_info_url
¶
-
license
¶
-
parent_experiment_id
¶
-
source
¶
-
title
¶
-
tracking_id
¶
-
variant_info
¶
-
-
class
clef.model.
Metadata
(**kwargs)[source]¶ Generic base class for Metadata of a file on Raijin
See
Posix
andNetcdf
for specific metadata information-
json
¶ Metadata value
-
type
¶ Metadata type
-
-
class
clef.model.
Netcdf
(**kwargs)[source]¶ NetCDF metadata of a file on Raijin
As would be found by
ncdump -h
-
attributes
¶ File attributes
-
dimensions
¶ File dimensions
-
variables
¶ File variables
-
-
class
clef.model.
Path
(**kwargs)[source]¶ Path of a file on Raijin, with links to metadata
-
path
¶ File path at NCI
-
clef.esgf¶
Functions for searching the ESGF and matching the results against the MAS database
esgf_query()
performs a query against the ESGF web API.match_query()
performs an outer join of theesgf_query()
results against theclef.model.Path
tablefind_local_path()
andfind_missing_id()
use the results ofmatch_query()
to return the files that are replicated locally and missing from the replica respectively.
-
clef.esgf.
esgf_query
(query, fields, limit=5000, offset=0, distrib=True, replica=False, latest=None, **kwargs)[source]¶ Search the ESGF
Searches the ESGF using its API. Keyword arguments not listed here are passed on to the API search, they can either be single values or lists.
Parameters: - query (str) – Full text query
- fields (list) – Fields to return
- limit (int) – Maximum items to return
- offset (int) – Starting offset of returned items (use with limit for paging)
- distrib (bool) – Distribute the search across all nodes
- replica (bool) – Return replicated datasets
- latest (bool or None) – Return only latest (True), only not latest (False) or all versions (None)
- **kwargs – See the ESGF API docs
Returns: API response from ESGF, decoded from JSON into a Python dict
-
clef.esgf.
find_checksum_id
(query, **kwargs)[source]¶ Get checksums and IDs of matching files from ESGF
Searches ESGF using
esgf_query()
, then converts the response into a SQLAlchemy selectable for further processingParameters: **kwargs – See esgf_query()
Returns: - Values table of matching File objects, containing
- checksum
- id
- dataset_id
- title
- version
This table can be joined against the MAS database tables
-
clef.esgf.
find_local_path
(session, subq, oformat='file')[source]¶ Find the filesystem paths of ESGF matches
Converts the results of
match_query()
to local filesystem paths, either to the file itself or to the containing dataset.Parameters: - format ('file' or 'dataset') – Return the path to the file or the dataset directory
- subq – result of func:esgf_query
Returns: Iterable of strings with the paths to either paths or datasets
-
clef.esgf.
find_missing_id
(session, subq, oformat='file')[source]¶ Returns the ESGF id for each file in the ESGF query that doesn’t have a local match
Parameters: - format ('file' or 'dataset') – Return the path to the file or the dataset directory
- subq – result of func:esgf_query
Returns: Iterable of strings with the ESGF file or dataset id
-
clef.esgf.
link_to_esgf
(query, **kwargs)[source]¶ Convert search terms to a ESGF search URL
Returns a link to the user-facing ESGF web search matching a particular query. This is helpful for error messages, users can follow the URL to find the matches as ESGF sees them
Note that this link is to the ESGF user-facing search page, rather than the web API that
esgf_query()
uses.Parameters: **kwargs – As esgf_query()
Returns: URL to the ESGF search website Return type: str
-
clef.esgf.
match_query
(session, query, latest=None, **kwargs)[source]¶ Match ESGF results against
clef.model.Path
Matches the results of
find_checksum_id()
with thePath
table. If latest is True the checksums will be matched, otherwise only the file name is used in order to spot outdated versions that have been removed from ESGF.Parameters: - latest (bool) – Match the checksums (True) or filenames (False)
- **kwargs – See
esgf_query()
Returns: Joined result of
clef.model.Path
andfind_checksum_id()