Detailed Harvest Configuration¶
The following sections describe Harvest configuration file in more detail.
Node Name¶
Node name is a required parameter which is used to tag ingested data with the node it is ingested by.
<harvest nodeName="PDS_SBN">
...
- One of the following values can be used:
PDS_ATM - Planetary Data System: Atmospheres Node
PDS_ENG - Planetary Data System: Engineering Node
PDS_GEO - Planetary Data System: Geosciences Node
PDS_IMG - Planetary Data System: Imaging Node
PDS_NAIF - Planetary Data System: NAIF Node
PDS_RMS - Planetary Data System: Rings Node
PDS_SBN - Planetary Data System: Small Bodies Node at University of Maryland
PSA - Planetary Science Archive
JAXA - Japan Aerospace Exploration Agency
ROSCOSMOS - Russian State Corporation for Space Activities
This value is saved in “ops:Harvest_Info/ops:node_name” field in the loaded OpenSearch documents:
{
...
"ops:Harvest_Info/ops:node_name": "PDS_SBN",
...
}
Input Directories and Filters¶
Process Directories¶
To process products from one or more directories, add the following section in Harvest configuration file:
<harvest nodeName="PDS_SBN">
...
<directories>
<path>/some-directory/sub-dir-1/</path>
<path>/some-directory/sub-dir-2/</path>
</directories>
...
</harvest>
Note
You could not have both <directories> and <bundles> sections at the same time.
Process a List of Files¶
First, create a manifest file and list all files you want to process. One file path per line.
/data/d1/CCF_0088_0674757853_190FDR_N0040048CACH00100_0A10LLJ05.xml
/data/d1/CCF_0088_0674757853_190FDR_N0040048CACH00100_0A10LLJ07.xml
/data/d1/CCF_0088_0674757853_190FDR_N0040048CACH00100_0A10LLJ09.xml
Next, add the following section in Harvest configuration file:
<harvest nodeName="PDS_SBN">
...
<files>
<manifest>/some-directory/manifest.txt</manifest>
</files>
...
</harvest>
Filtering Products by Class¶
You can include or exclude products of a particular class. For example, to only process documents, add following product filter in Harvest configuration file:
<harvest nodeName="PDS_SBN">
...
<productFilter>
<includeClass>Product_Document</includeClass>
</productFilter>
...
</harvest>
To exclude documents, add following product filter:
<harvest nodeName="PDS_SBN">
...
<productFilter>
<excludeClass>Product_Document</excludeClass>
</productFilter>
...
</harvest>
Note
You could not have both include and exclude filters at the same time.
Process Bundles¶
(only applies to command line harvest)
To process products from one or more bundles, add the following section in Harvest configuration file:
<harvest nodeName="PDS_SBN">
...
<bundles>
<bundle dir="/data/geo/urn-nasa-pds-kaguya_grs_spectra" />
<bundle dir="/data/geo/urn-nasa-pds-trang2020_moon_space_weathering" />
</bundles>
...
</harvest>
Note
You could not have both <directories> and <bundles> sections at the same time.
Filtering Bundle Versions¶
(only applies to command line harvest)
Use “versions” attribute of the <bundle> tag to list versions of bundles to process. You can separate versions by comma, semicolon or space.
<harvest nodeName="PDS_SBN">
...
<bundles>
<bundle dir="/data/OREX/orex_spice" versions="7.0;8.0" />
</bundles>
...
</harvest>
To process all versions you can use either versions=”all” or no versions attribute at all.
<harvest nodeName="PDS_SBN">
...
<bundles>
<bundle dir="/data/OREX/orex_spice" versions="all" />
</bundles>
...
</harvest>
Filtering Bundle’s Collections¶
(only applies to command line harvest)
By default Harvest will process all collections listed in <Bundle_Member_Entry> section of a bundle. To process a subset of collections you can provide a list of lids or lidvids as shown below.
<!-- Filter by collection LID -->
<bundle dir="/data/OREX/orex_spice" versions="8.0" >
<collection lid="urn:nasa:pds:orex.spice:spice_kernels" />
</bundle>
<!-- Filter by collection LIDVID -->
<bundle dir="/data/OREX/orex_spice" versions="8.0;7.0" >
<collection lidvid="urn:nasa:pds:orex.spice:spice_kernels::8.0" />
<collection lidvid="urn:nasa:pds:orex.spice:spice_kernels::7.0" />
</bundle>
Filtering Bundle’s Product Directories¶
(only applies to command line harvest)
By default Harvest will process all products listed in the collection inventory file. To process a subset of products you can provide a list of directories.
<bundle dir="/data/OREX/orex_spice" versions="8.0" >
<!-- Specify a substring in a relative (to the bundle root) directory name. -->
<product dir="/fk/" />
</bundle>
File Reference / Access URL¶
Harvest extracts absolute paths of product and label files, such as
"ops:Label_File_Info/ops:file_ref":"/tmp/d5/naif0012.xml",
"ops:Data_File_Info/ops:file_ref":"/tmp/d5/naif0012.tls",
Note that on Windows, backslashes are replaced with forward slashes and disk letter is included.
"ops:Label_File_Info/ops:file_ref":"C:/tmp/d4/bundle_orex_spice_v009.xml",
To replace a file path prefix with another value, such as a URL, add <fileRef/> tag in Harvest configuration file:
<fileInfo>
<fileRef replacePrefix="/C:/tmp/d4/"
with="https://naif.jpl.nasa.gov/pub/naif/pds/pds4/orex/orex_spice/" />
</fileInfo>
After running Harvest, you should get different file_ref value:
"ops:Label_File_Info/ops:file_ref":
"https://naif.jpl.nasa.gov/pub/naif/pds/pds4/orex/orex_spice/bundle_orex_spice_v009.xml"
Registry Integration¶
(only applies to command line harvest)
Standalone Harvest tool loads extracted PDS4 metadata into OpenSearch database. You have to configure following OpenSearch parameters:
url - Registry (OpenSearch) URL
index - OpenSearch index name. This is an optional parameter. Default value is ‘registry’.
auth - Registry (OpenSearch) authentication configuration file. This is an optional parameter.
Below are few examples:
Local OpenSearch instance (localhost)
<harvest nodeName="PDS_SBN">
...
<registry url="http://localhost:9200" index="registry" />
...
</harvest>
Note
In the URL attribute, always have a port specified, which for PDS Registries in AWS, this port should be 443. If a port is not specified, it will default to the OpenSearch default port of 9200, and any attempted writes/updates of the registry will fail.
Remote OpenSearch instance (on-prem or cloud)
<harvest nodeName="PDS_SBN">
...
<registry url="https://es-server.mydomain.com:443" index="registry" auth="/path/to/auth.cfg" />
...
</harvest>
If your OpenSearch server requires authentication, you have to create an authentication configuration file and provide following parameters:
# true - trust self-signed certificates; false - don't trust.
trust.self-signed = true
user = pds-user1
password = mypassword
Label and Data File Information¶
(only applies to command line harvest)
By default, Harvest extracts label and data file information, such as file name, mime type, size, and MD5 hash.
Label:
"ops:Label_File_Info/ops:creation_date_time":"2020-11-18T22:25:05Z",
"ops:Label_File_Info/ops:file_name":"naif0012.xml",
"ops:Label_File_Info/ops:file_ref":"/C:/tmp/d5/naif0012.xml",
"ops:Label_File_Info/ops:file_size":"3398",
"ops:Label_File_Info/ops:md5_checksum":"69ea2974a93854d90399b8b8fc3d1334"
Data file:
"ops:Data_File_Info/ops:creation_date_time":"2020-11-18T22:25:17Z",
"ops:Data_File_Info/ops:file_name":"naif0012.tls",
"ops:Data_File_Info/ops:file_ref":"/C:/tmp/d5/naif0012.tls",
"ops:Data_File_Info/ops:file_size":"5257",
"ops:Data_File_Info/ops:md5_checksum":"25a2fff30b0dedb4d76c06727b1895b1",
"ops:Data_File_Info/ops:mime_type":"text/plain",
If you don’t want to process data files, add the following flag in Harvest configuration file.
<fileInfo processDataFiles="false" />
BLOB Storage¶
(only applies to command line harvest)
By default, Harvest stores PDS product labels as BLOBs (Binary Large OBjects). Both original PDS product labels in XML format as well as product labels converted to JSON are stored. The data is compressed and stored in following fields: “ops/Label_File_Info/ops/blob” and “ops/Label_File_Info/ops/json_blob”.
You can expect up to 900% compression rate for some files. For example, many LADEE housekeeping labels are about 45KB. Compressed BLOB size is about 5KB. For smaller files, such as collection labels, compression rate is about 350% (5.5KB file is compressed to 1.6KB).
After loading data into OpenSearch, you can extract original labels by running Registry Manager tool:
registry-manager export-file \
-lidvid urn:nasa:pds:ladee_ldex:data_calibrated::1.2 \
-file /tmp/data_calibrated.xml
To disable BLOB storage, modify fileInfo section in Harvest configuration file.
<fileInfo storeLabels="false" storeJsonLabels="false" />
Extract Metadata by XPath¶
(only applies to command line harvest)
To extract metadata by XPath, you have to create one or more mapping files and list them in Harvest configuration file as shown below.
<harvest nodeName="PDS_SBN">
...
<xpathMaps baseDir="/home/pds/harvest/conf">
<xpathMap filePath="common.xml" />
<xpathMap rootElement="Product_Observational" filePath="observational.xml" />
</xpathMaps>
</harvest>
In the example above there are two xpathMap entries. Each entry must have filePath attribute pointing to a mapping file. A path can be either absolute or relative to the baseDir attribute of the xpathMaps tag. The baseDir attribute is optional. The same example with absolute paths is shown below.
<xpathMaps>
<xpathMap filePath="/home/pds/harvest/conf/common.xml" />
<xpathMap rootElement="Product_Observational"
filePath="/home/pds/harvest/conf/observational.xml" />
</xpathMaps>
An xpathMap entry can have optional rootElement attribute. Without this attribute, XPaths queries defined in a mapping file (common.xml), will run against every XML document processed by Harvest. With rootElement attribute, only XMLs with that root element will be processed.
Mapping Files¶
A mapping file has one or more entries which map an output field name to an XPath query. For example, to extract start_date_time and stop_date_time from observational products, you can use the following file.
<?xml version="1.0" encoding="UTF-8"?>
<xpaths>
<xpath fieldName="start_date_time">/Product_Observational/Observation_Area/Time_Coordinates/start_date_time</xpath>
<xpath fieldName="stop_date_time">/Product_Observational/Observation_Area/Time_Coordinates/stop_date_time</xpath>
</xpaths>
</source>
You can use optional dataType=”date” attribute to convert valid PDS dates to ISO-8601 “instant” format (e.g., “2013-10-24T00:49:37.457Z”).
<xpaths>
<xpath fieldName="start_date_time"
dataType="date">/Product_Observational/Observation_Area/Time_Coordinates/start_date_time</xpath>
<xpath fieldName="stop_date_time"
dataType="date">/Product_Observational/Observation_Area/Time_Coordinates/stop_date_time</xpath>
</xpaths>
XML Name Spaces¶
Harvest ignores namespaces when extracting metadata by XPath. Below is a fragment of LADEE UVS product label which uses “ladee” namespace for mission area fields.
<Observation_Area>
<Mission_Area>
<ladee:latitude>17.2367925372247</ladee:latitude>
<ladee:longitude>194.054477731391</ladee:longitude>
...
To extract latitude and longitude you can use the following XPaths without namespaces.
<xpaths>
<xpath fieldName="latitude">//Mission_Area/latitude</xpath>
<xpath fieldName="longitude">//Mission_Area/longitude</xpath>
</xpaths>