Load Data - Standalone Harvest¶
Overview¶
To load PDS4 data into Registry you have to use Harvest software. There are two versions of Harvest: a simple standalone command-line tool and a scalable Harvest which can process big data sets in parallel thanks to distributed components comminucating through a message broker. Both versions extract metadata from PDS4 products (labels) and load extracted metadata into OpenSearch database.
This document describes how to load data with Standalone Harvest command-line tool.
Prerequisites¶
Standalone Harvest Quick Start¶
To run Harvest you need a job configuration file (XML). The configuration file has several sections such as Registry (OpenSearch) configuration and the path to the data. Example configuration files are located in <INSTALL_DIR>/conf/examples.
The most useful configuration for an Harvest’s job is conf/examples/directories.xml. You will want to update the nodeName:
<harvest nodeName="PDS_GEO">
Registry (OpenSearch) configuration:
<registry url="http://localhost:9200" index="registry" auth="/path/to/auth.cfg" />
For details on the registry configuration, see Registry Integration.
Note
In the URL attribute, always have a port specified, which for PDS Registries in AWS, this port should be 443. If a port is not specified, it will default to the OpenSearch default port of 9200, and any attempted writes/updates of the registry will fail.
The path to the data:
<directories>
<path>/data/geo/urn-nasa-pds-kaguya_grs_spectra</path>
</directories>
And the URL prefix for the data:
<fileInfo>
<!-- UPDATE with your own local path and base url where pds4 archive is published -->
<fileRef replacePrefix="/data/geo/" with="https://pds-geosciences.wustl.edu/lunar/" />
</fileInfo>
If you save this file as /tmp/kaguya.cfg and run Harvest
harvest -c /tmp/kaguya.cfg
all XML files in /data/orex/orex_spice folder and its subfolders will be processed. All metadata from PDS4 labels will be extracted and loaded into Registry (OpenSearch).
You will see multiple log messages similar to these:
...
[INFO] Processing C:\Geo\kaguya_grs_spectra\data_ephemerides\kgrs_ephemerides.xml
[INFO] Processing C:\Geo\kaguya_grs_spectra\data_spectra\kgrs_calibrated_spectra_per1.xml
[INFO] Processing C:\Geo\kaguya_grs_spectra\data_spectra\kgrs_calibrated_spectra_per2.xml
[INFO] Processing C:\Geo\kaguya_grs_spectra\data_spectra\kgrs_calibrated_spectra_per3.xml
[INFO] Processing C:\Geo\kaguya_grs_spectra\data_spectra\spectra_data_collection_inventory.xml
...
[SUMMARY] Summary:
[SUMMARY] Skipped files: 0
[SUMMARY] Processed files: 14
[SUMMARY] File counts by type:
[SUMMARY] Product_Bundle: 1
[SUMMARY] Product_Collection: 4
[SUMMARY] Product_Context: 3
[SUMMARY] Product_Document: 2
[SUMMARY] Product_Observational: 4
[SUMMARY] Package ID: e46f6ba9-6151-48ee-b822-b0536e3e4bd9
To quickly check that data was loaded you can query Registry indices in OpenSearch by calling OpenSearch Search API or in a web browser. For example,
# Select all products
curl "http://localhost:9200/registry/_search?q=*&pretty"
# Select only collections
curl "http://localhost:9200/registry/_search?q=product_class:Product_Collection&pretty"
This page describes the job configuration file in detail.
Next Steps¶
When ready for public release, you need to update the archive status