🏃 ️Usage
This section describes how to use the PDS DUM client script.
pds-ingress-client Description
Upon installation of the DUM application, the pds-ingress-client
script will
be available for use on the command-line.
To see the usage documentation, you can run the following:
$ pds-ingress-client --help
The client application only has two arguments which must provided on each invocation,
at least one path to files or directories to be uploaded, and the name of the PDS
node the submission is on behalf of (via the --node
argument).
When specifying the list of paths to be uploaded, any paths corresponding to directories will be automatically recursed by the client script, and each file within the directory will be included in the set uploaded to PDS Cloud. Any sub-directories will be similarly recursed to find any additional files within, until the entire directory tree is traversed.
The client script provides --include
and --exclude
arguments which can
be used to filter the set of files included in the upload request. Both arguments
support Unix shell-style wildcards (e.g. *.xml
), and can be specified multiple
times to include or exclude multiple patterns. The --include
argument is applied
first, followed by the --exclude
argument. If no --include
arguments are
provided, all files are included by default.
Specifying the node ID of the requestor is accomplished via the --node
argument,
specifying one of the following node name values: atm,eng,geo,img,naif,ppi,rs,rms,sbn
The --dry-run
flag can be used to have pds-ingress-client
determine the
full set of files and directories to be processed without actually submitting
anything for ingest to PDS Cloud. This feature can be useful to ensure the correct
set of files are being included for a request before performing any communication
with the Server side of DUM.
The details of where the client script should submit ingest requests to are configured
within an INI file. After installing DUM, a new INI config should be created with the
appropriate values for each field. Once available, the config for use with a request
can be specified via the --config-path
argument. See the Client Configuration section
of the installation instructions for more details on creating the INI config.
In its default configuration, the pds-ingress-client
script includes the full path
of each discovered file when sending the request to the Server side components. Typically,
this can include unwanted path components, such as a user’s home directory. To control
the path components included, the --prefix
argument can be used to specify a path
prefix that will be trimmed from all file paths discovered by pds-ingress-client
.
For example, if the file /home/user/bundle/file.xml
were to be uploaded, and
--prefix /home/user
were also provided, the path provided to the ingress service
in AWS would resolve to bundle/file.xml
.
The --weblogs LOG_TYPE
argument can be used to indicate that weblog files are being uploaded.
The type of log files (e.g. apache
, nginx
, etc.) should be specified as the
LOG_TYPE
argument. When this flag is provided, the client will automatically adjust
the “trimmed” path for each ingested file to replace the specified path prefix with weblog/
,
ensuring that all weblog files are routed to a specific S3 bucket used for web analytics.
Because of this, the --prefix
argument must be provided when using the --weblogs
argument.
The pds-ingress-client
by default utilizes all available CPUs on the
local machine to perform parallelized ingress requests to the ingress service. The exact
number of threads can be controlled via the --num-threads
argument.
The client script also provides several arguments for reporting status of an ingress request:
--report-path
: Specifies a path to write a detailed report file in JSON format, containing details about which files were uploaded, skipped or failed during transfer.--manifest-path
: Specifies a path to write the manifest of all files included in ingress request, including file checksums. This option may also be used to provide an existing Manifest file to bypass its recomputation during subsequent requests.--log-path
: Specifies a path to write a trace log of the ingress request which does not go to the console. Can be useful for troubleshooting tranfer failures.
Lastly, the --version
option may be used to verify the version number of installed DUM client.
Data Upload Manager Client Workflow
When utilizing the DUM Client script (pds-ingress-client), the following workflow is executed:
Indexing of the requested input files/paths to determine the full input file set
Generation of a Manifest file, containing information, including MD5 checksums, of each file to be ingested
Batch ingress requesting of input file set to the DUM Ingress Service in AWS
Batch upload of input file set to AWS S3
Ingress report creation
Determination of the input file set is determined in Step 1 by resolving the paths providing on the command-line to the DUM client. Any directories provided are recursed to determine the full set of files within. Any paths provided are included as-is into the input file set.
Depending on the size of the input file set, the Manifest file creation in Step 2 can become time-consuming due to the hashing of each file in the input file set. To save time, the –manifest-path command-line option should be leveraged to write the contents of the Manifest to local disk. Specifying the same path via –manifest-path on subsequent executions of the DUM client will result in a read of the existing Manifest from disk. Any files within the input set referenced within the read Manifest will reuse the precomputed values within, saving upfront time prior to start of upload to S3. The Manifest will then be re-written to the path specified by –manifest-path to include any new files encountered. In this way, a Manifest file can expand across executions of DUM to serve as a sort of cache for file information.
The batch size utilized by Steps 3 and 4 can be configured within the INI config provided to the DUM client. The number of batches processed in parallel can be controlled via the –num-threads command-line argument.
By default, at completion of an ingress request (Step 5), the DUM client provides a summary of the results of the transfer:
Ingress Summary Report for 2025-02-25 11:41:29.507022
-----------------------------------------------------
Uploaded: 200 file(s)
Skipped: 0 file(s)
Failed: 0 file(s)
Unprocessed: 0 file(s)
Total: 200 files(s)
Time elapsed: 3019.00 seconds
Bytes transferred: 3087368895
A more detailed JSON-format report, containing full listings of all uploaded/skipped/failed paths, can be written to disk via the –report-path command-line argument:
{
"Arguments": "Namespace(config_path='mcp.test.ingress.config.ini', node='sbn', prefix='/PDS/SBN/', force_overwrite=True, num_threads=4, log_path='/tmp/dum_log.txt', manifest_path='/tmp/dum_manifest.json', report_path='/tmp/dum_report.json', dry_run=False, log_level='info', ingress_paths=['/PDS/SBN/gbo.ast.catalina.survey/'])",
"Batch Size": 3,
"Total Batches": 67,
"Start Time": "2025-02-25 18:51:10.507562+00:00",
"Finish Time": "2025-02-25 19:41:29.504806+00:00",
"Uploaded": [
"gbo.ast.catalina.survey/data_calibrated/703/2020/20Apr02/703_20200402_2B_F48FC1_01_0001.arch.fz",
...
"gbo.ast.catalina.survey/data_calibrated/703/2020/20Apr02/703_20200402_2B_N02055_01_0001.arch.xml"
],
"Total Uploaded": 200,
"Skipped": [],
"Total Skipped": 0,
"Failed": [],
"Total Failed": 0,
"Unprocessed": [],
"Total Unprocessed": 0,
"Bytes Transferred": 3087368895,
"Total Files": 200
}
Lastly, a detailed log file containing trace statements for each file/batch uploaded can be written to disk via the –log-path command-line argument. The log file path may also be specifed within the INI config.
Automatic Retry of Failed Uploads
The DUM client script is configured to automatically retry any failed uploads to S3 using exponential backoff and retry. When an intermittent failure occurs during upload, messages pertaining to the backoff and retry are logged to the log file (which can be specified via the –log-path argument).
Here is an example of such log messages:
...
[2025-09-23 16:21:24,491] INFO Thread-9 (worker) ingress_file_to_s3 : Batch 0 : Ingesting mflat.703.19Dec20.fits.xml to https://pds-staging.s3.amazonaws.com/mflat.703.19Dec20.fits.xml
[2025-09-23 16:21:24,493] WARNING Thread-9 (worker) backoff_handler : Backing off ingress_file_to_s3() for 0.2 seconds after 1 tries, reason: HTTPError
[2025-09-23 16:21:24,665] INFO Thread-9 (worker) ingress_file_to_s3 : Batch 0 : Ingesting mflat.703.19Dec20.fits.xml to https://pds-staging.s3.amazonaws.com/mflat.703.19Dec20.fits.xml
[2025-09-23 16:21:24,667] WARNING Thread-9 (worker) backoff_handler : Backing off ingress_file_to_s3() for 1.2 seconds after 2 tries, reason: HTTPError
[2025-09-23 16:21:25,832] INFO Thread-9 (worker) ingress_file_to_s3 : Batch 0 : Ingesting mflat.703.19Dec20.fits.xml to https://pds-staging.s3.amazonaws.com/mflat.703.19Dec20.fits.xml
[2025-09-23 16:21:25,833] WARNING Thread-9 (worker) backoff_handler : Backing off ingress_file_to_s3() for 1.8 seconds after 3 tries, reason: HTTPError
[2025-09-23 16:21:27,644] INFO Thread-9 (worker) ingress_file_to_s3 : Batch 0 : Ingesting mflat.703.19Dec20.fits.xml to https://pds-staging.s3.amazonaws.com/mflat.703.19Dec20.fits.xml
[2025-09-23 16:21:27,720] INFO Thread-9 (worker) ingress_file_to_s3 : Batch 0 : mflat.703.19Dec20.fits.xml Ingest complete
Typically, log messages pertaining to backoff and retry can be safely ignored if upload is eventually succesful, as in the above example. However, if an upload ultimately fails after all retries are exhausted it could indicate a more serious problem that needs to be investigated:
...
[2025-09-23 16:31:47,231] WARNING Thread-9 (worker) backoff_handler : Backing off ingress_file_to_s3() for 30.9 seconds after 6 tries, reason: HTTPError
[2025-09-23 16:32:18,099] INFO Thread-9 (worker) ingress_file_to_s3 : Batch 0 : Ingesting mflat.703.19Dec20.fits.fz to https://pds-staging.s3.amazonaws.com/mflat.703.19Dec20.fits.fz
[2025-09-23 16:32:18,101] WARNING Thread-9 (worker) backoff_handler : Backing off ingress_file_to_s3() for 23.2 seconds after 7 tries, reason: HTTPError
[2025-09-23 16:32:41,324] INFO Thread-9 (worker) ingress_file_to_s3 : Batch 0 : Ingesting mflat.703.19Dec20.fits.fz to https://pds-staging.s3.amazonaws.com/mflat.703.19Dec20.fits.fz
[2025-09-23 16:32:41,326] WARNING Thread-9 (worker) backoff_handler : Backing off ingress_file_to_s3() for 54.8 seconds after 8 tries, reason: HTTPError
[2025-09-23 16:33:36,086] INFO Thread-9 (worker) ingress_file_to_s3 : Batch 0 : Ingesting mflat.703.19Dec20.fits.fz to https://pds-staging.s3.amazonaws.com/mflat.703.19Dec20.fits.fz
[2025-09-23 16:33:36,087] ERROR Thread-9 (worker) _process_batch : Batch 0 : Ingress failed for mflat.703.19Dec20.fits.fz, Reason: 403 Client Error
Any files that fail to upload after all retries are exhausted are reattempted in one final attempt at the end of DUM client execution:
...
[2025-09-23 16:33:36,094] INFO MainThread main : All batches processed
[2025-09-23 16:33:36,094] INFO MainThread main : ----------------------------------------
[2025-09-23 16:33:36,094] INFO MainThread main : Reattempting ingress for failed files...
[2025-09-23 16:33:36,096] INFO Thread-16 (worker) _prepare_batch_for_ingress : Batch 0 : Preparing for ingress
[2025-09-23 16:33:36,096] INFO Thread-16 (worker) _prepare_batch_for_ingress : Batch 0 : Prep completed in 0.00 seconds
[2025-09-23 16:33:36,108] INFO Thread-23 (worker) request_batch_for_ingress : Batch 0 : Requesting ingress
[2025-09-23 16:33:36,732] INFO Thread-23 (worker) request_batch_for_ingress : Batch 0 : Ingress request completed in 0.62 seconds
[2025-09-23 16:33:36,734] INFO Thread-23 (worker) ingress_file_to_s3 : Batch 0 : Ingesting mflat.703.19Dec20.fits.fz to https://pds-sbn-staging-dev.s3.amazonaws.com/mflat.703.19Dec20.fits.fz
Files that still fail to upload during this final attempt are recorded in the final summary report:
Ingress Summary Report for 2025-09-23 16:35:37.532468
-----------------------------------------------------
Uploaded: 0 file(s)
Skipped: 0 file(s)
Failed: 1 file(s)
Unprocessed: 0 file(s)
Total: 1 files(s)
Time elapsed: 244.87 seconds
Bytes transferred: 0
Should persistent failures like this occur, they should be communicated to the PDS Operations team for investigation.