About Scalable Harvest

Scalable Harvest is a distributed application for crawling filesystem, extracting PDS4 product metadata and loading it into PDS Registry (Elasticsearch). The application is based on microservices and messaging architecture shown below.

Main Components

  • Message broker - enables services and applications to communicate with each other using messages. Currently we utilize RabbitMQ, one of the most popular open source message brokers.
  • Harvest client - submits new jobs to the "job" message queue. Harvest client's command-line interface and configuration is very similar to existing standalone Harvest.
  • Crawler server - one or more crawler servers process messages from the "job" queue. Each server crawls directories listed in job messages. Paths of PDS4 label files are combined into batches and published to the "file" queue. "Directory" queue is used to process sub-folders.
  • Harvest server - one or more harvest servers process messages from the "file" queue. Each server extracts metadata from PDS4 labels listed in the message and stores extracted information in the Registry (Elasticsearch).