🏃‍ ️Usage

This section describes how to use the PDS DUM client script.

pds-ingress-client Description

Upon installation of the DUM application, the pds-ingress-client script will be available for use on the command-line.

To see the usage documentation, you can run the following:

$ pds-ingress-client --help

Command-Line Arguments Reference

Client side script used to perform ingress request to the DUM service in AWS.

usage: pds-ingress-client [-h] [-c CONFIG_PATH]
                          -n {atm,eng,geo,img,naif,ppi,rs,rms,sbn}
                          [--prefix PREFIX] [--weblogs LOG_TYPE]
                          [--force-overwrite] [--include PATTERN]
                          [--exclude PATTERN] [--num-threads NUM_THREADS]
                          [--log-path LOG_PATH]
                          [--manifest-path MANIFEST_PATH]
                          [--report-path REPORT_PATH] [--dry-run]
                          [--skip-symlinks]
                          [--log-level {warn,warning,info,debug}] [--version]
                          file_or_dir [file_or_dir ...]

Positional Arguments

file_or_dir

One or more paths to the files to ingest to S3. For each directory path is provided, this script will automatically derive all sub-paths for inclusion with the ingress request.

Named Arguments

-c, --config-path

Path to the INI config for use with this client. If not provided, the default config (/github/workspace/src/pds/ingress/conf.default.ini) is used.

-n, --node

Possible choices: atm, eng, geo, img, naif, ppi, rs, rms, sbn

PDS node identifier of the ingress requestor. This value is used by the Ingress service to derive the S3 upload location. Argument is case-insensitive.

--prefix, -p

Specify a path prefix to be trimmed from each resolved ingest path such that is is not included with the request to the Ingress Service. For example, specifying –prefix “/home/user” would modify paths such as “/home/user/bundle/file.xml” to just “bundle/file.xml”. This can be useful for controlling which parts of a directory structure should be included with the S3 upload location returned by the Ingress Service.

--weblogs, -w

Denotes the upload request as being for LOG_TYPE web logs. All uploaded files will be routed to a special S3 location reserved for web log files. LOG_TYPE denotes the type of web logs being uploaded, and becomes part of the destination upload path. If provided, –prefix must be provided as well.

--force-overwrite, -f

By default, the DUM service determines if a given file has already been ingested to the PDS Cloud and has not changed. If so, ingress of the file is skipped. Use this flag to override this behavior and forcefully overwrite any existing versions of files within the PDS Cloud.

Default: False

--include, -i

Specify a file path pattern to match against when determining which files should be included with an Ingress request. Unix-style wildcard patterns are supported. Include patterns are always applied prior to any Exclude patterns. This argument can be specified multiple times to configure multiple include patterns. Include patterns are evaluated in the order they provided.

Default: []

--exclude, -e

Specify a file path pattern to match against when determining which files should be excluded from an Ingress request. Unix-style wildcard patterns are supported. Exclude patterns are always applied after any Include patterns. This argument can be specified multiple times to configure multiple exclude patterns. Exclude patterns are evaluated in the order they provided.

Default: []

--num-threads, -t

Specify the number of threads to use when uploading files to S3 in parallel. By default, all available cores are used.

Default: 4

--log-path

Specify a file path to write logging statements to. These will include some of the messages logged to the console, as well as additional messages about the status of each file/batch transfer. By default, the log file is created in a temporary location if this parameter is not provided. If provided, this argument takes precedence over what is provided for OTHER.log_file_path in the INI config.

--manifest-path

Specify a file path to a JSON manifiest of all files indexed for inclusion in the current ingress request. If the provided path is not an existing file, then the manifest will be written to that location. If the path already exists, this script will read the manifiest, and skip checksum generation for any paths that are already specified. If not provided, no manifiest is written or read.

--report-path, -r

Specify a path to write a JSON summary report containing the full listing of all files ingressed, skipped or failed. By default, no report is created.

--dry-run

Derive the full set of ingress paths without performing any submission requests to the server.

Default: False

--skip-symlinks

Do not follow symbolic links when resolving ingress paths. Use this option to avoid uploading duplicate data when files are symlinked into multiple locations.

Default: False

--log-level, -l

Possible choices: warn, warning, info, debug

Sets the Logging level for logged messages. If not provided, the logging level set in the INI config is used instead.

--version

Print the Data Upload Manager release version and exit.

Usage Notes

The client application only has two arguments which must provided on each invocation, at least one path to files or directories to be uploaded, and the name of the PDS node the submission is on behalf of (via the --node argument).

When specifying the list of paths to be uploaded, any paths corresponding to directories will be automatically recursed by the client script, and each file within the directory will be included in the set uploaded to PDS Cloud. Any sub-directories will be similarly recursed to find any additional files within, until the entire directory tree is traversed. By default, symbolic links are followed during path resolution, meaning that symlinked files and directories will be included in the upload set. To prevent uploading duplicate data when files are symlinked into multiple locations within a data delivery structure, use the --skip-symlinks flag to skip symbolic links during traversal.

The client script provides --include and --exclude arguments which can be used to filter the set of files included in the upload request. Both arguments support Unix shell-style wildcards (e.g. *.xml), and can be specified multiple times to include or exclude multiple patterns. The --include argument is applied first, followed by the --exclude argument. If no --include arguments are provided, all files are included by default.

Specifying the node ID of the requestor is accomplished via the --node argument, specifying one of the following node name values: atm,eng,geo,img,naif,ppi,rs,rms,sbn

The --dry-run flag can be used to have pds-ingress-client determine the full set of files and directories to be processed without actually submitting anything for ingest to PDS Cloud. This feature can be useful to ensure the correct set of files are being included for a request before performing any communication with the Server side of DUM.

The details of where the client script should submit ingest requests to are configured within an INI file. After installing DUM, a new INI config should be created with the appropriate values for each field. Once available, the config for use with a request can be specified via the --config-path argument. See the Client Configuration section of the installation instructions for more details on creating the INI config.

In its default configuration, the pds-ingress-client script includes the full path of each discovered file when sending the request to the Server side components. Typically, this can include unwanted path components, such as a user’s home directory. To control the path components included, the --prefix argument can be used to specify a path prefix that will be trimmed from all file paths discovered by pds-ingress-client.

For example, if the file /home/user/bundle/file.xml were to be uploaded, and --prefix /home/user were also provided, the path provided to the ingress service in AWS would resolve to bundle/file.xml.

The --weblogs LOG_TYPE argument can be used to indicate that weblog files are being uploaded. The type of log files (e.g. apache, nginx, etc.) should be specified as the LOG_TYPE argument. When this flag is provided, the client will automatically adjust the “trimmed” path for each ingested file to replace the specified path prefix with weblog/, ensuring that all weblog files are routed to a specific S3 bucket used for web analytics. Because of this, the --prefix argument must be provided when using the --weblogs argument.

Warning

All weblog files must be gzip-compressed (.gz extension). When using --weblogs, the client validates that every file in the request has a .gz extension before any uploads begin. If even a single non-gzipped file is present in the input file set, the entire request will fail immediately with no files being uploaded.

For example, if you attempt to upload a directory containing both access.log.gz and access.log.txt, the upload will abort with an error listing the non-gzipped files. To successfully upload, either:

  1. Compress all files with gzip before uploading, or

  2. Use the --exclude argument to filter out non-gzipped files (e.g., --exclude "*.txt")

The pds-ingress-client by default utilizes all available CPUs on the local machine to perform parallelized ingress requests to the ingress service. The exact number of threads can be controlled via the --num-threads argument.

The client script also provides several arguments for reporting status of an ingress request:

  • --report-path : Specifies a path to write a detailed report file in JSON format, containing details about which files were uploaded, skipped or failed during transfer.

  • --manifest-path : Specifies a path to write the manifest of all files included in ingress request, including file checksums. This option may also be used to provide an existing Manifest file to bypass its recomputation during subsequent requests.

  • --log-path : Specifies a path to write a trace log of the ingress request which does not go to the console. Can be useful for troubleshooting tranfer failures.

Lastly, the --version option may be used to verify the version number of installed DUM client.

Data Upload Manager Client Workflow

When utilizing the DUM Client script (pds-ingress-client), the following workflow is executed:

  1. Indexing of the requested input files/paths to determine the full input file set

  2. Generation of a Manifest file, containing information, including MD5 checksums, of each file to be ingested

  3. Batch ingress requesting of input file set to the DUM Ingress Service in AWS

  4. Batch upload of input file set to AWS S3

  5. Ingress report creation

Determination of the input file set is determined in Step 1 by resolving the paths providing on the command-line to the DUM client. Any directories provided are recursed to determine the full set of files within. Any paths provided are included as-is into the input file set.

Depending on the size of the input file set, the Manifest file creation in Step 2 can become time-consuming due to the hashing of each file in the input file set. To save time, the –manifest-path command-line option should be leveraged to write the contents of the Manifest to local disk. Specifying the same path via –manifest-path on subsequent executions of the DUM client will result in a read of the existing Manifest from disk. Any files within the input set referenced within the read Manifest will reuse the precomputed values within, saving upfront time prior to start of upload to S3. The Manifest will then be re-written to the path specified by –manifest-path to include any new files encountered. In this way, a Manifest file can expand across executions of DUM to serve as a sort of cache for file information.

Understanding Batch Processing

The batch size utilized by Steps 3 and 4 can be configured within the INI config provided to the DUM client via the batch_size setting in the [OTHER] section. The number of batches processed in parallel can be controlled via the --num-threads command-line argument.

During execution, the client splits the input file set into batches and processes them in parallel. Log messages clearly indicate this batching behavior:

INFO MainThread main : Using batch size of 250
INFO MainThread main : Request (500 files) split into 2 batches

Each batch is then prepared and uploaded independently. If individual files within a batch fail to upload (due to network issues, permission errors, etc.), those files are tracked separately and retried at the end of the ingress process. The remaining files in the batch continue to upload normally - a single file failure does not cause the entire batch to fail.

Note

The batching described here applies to the upload phase (Steps 3-4). Pre-flight validation checks (such as the gzip validation for weblog uploads) are performed on the entire input file set before batching occurs. If pre-flight validation fails, no files will be uploaded.

By default, at completion of an ingress request (Step 5), the DUM client provides a summary of the results of the transfer:

Ingress Summary Report for 2025-02-25 11:41:29.507022
-----------------------------------------------------
Uploaded: 200 file(s)
Skipped: 0 file(s)
Failed: 0 file(s)
Unprocessed: 0 file(s)
Total: 200 files(s)
Time elapsed: 3019.00 seconds
Bytes transferred: 3087368895

A more detailed JSON-format report, containing full listings of all uploaded/skipped/failed paths, can be written to disk via the –report-path command-line argument:

{
    "Arguments": "Namespace(config_path='mcp.test.ingress.config.ini', node='sbn', prefix='/PDS/SBN/', force_overwrite=True, num_threads=4, log_path='/tmp/dum_log.txt', manifest_path='/tmp/dum_manifest.json', report_path='/tmp/dum_report.json', dry_run=False, log_level='info', ingress_paths=['/PDS/SBN/gbo.ast.catalina.survey/'])",
    "Batch Size": 3,
    "Total Batches": 67,
    "Start Time": "2025-02-25 18:51:10.507562+00:00",
    "Finish Time": "2025-02-25 19:41:29.504806+00:00",
    "Uploaded": [
        "gbo.ast.catalina.survey/data_calibrated/703/2020/20Apr02/703_20200402_2B_F48FC1_01_0001.arch.fz",
        ...
        "gbo.ast.catalina.survey/data_calibrated/703/2020/20Apr02/703_20200402_2B_N02055_01_0001.arch.xml"
    ],
    "Total Uploaded": 200,
    "Skipped": [],
    "Total Skipped": 0,
    "Failed": [],
    "Total Failed": 0,
    "Unprocessed": [],
    "Total Unprocessed": 0,
    "Bytes Transferred": 3087368895,
    "Total Files": 200
}

Lastly, a detailed log file containing trace statements for each file/batch uploaded can be written to disk via the –log-path command-line argument. The log file path may also be specifed within the INI config.

Automatic Retry of Failed Uploads

The DUM client script is configured to automatically retry any failed uploads to S3 using exponential backoff and retry. When an intermittent failure occurs during upload, messages pertaining to the backoff and retry are logged to the log file (which can be specified via the –log-path argument).

Here is an example of such log messages:

...
[2025-09-23 16:21:24,491] INFO Thread-9 (worker) ingress_file_to_s3 : Batch 0 : Ingesting mflat.703.19Dec20.fits.xml to https://pds-staging.s3.amazonaws.com/mflat.703.19Dec20.fits.xml
[2025-09-23 16:21:24,493] WARNING Thread-9 (worker) backoff_handler : Backing off ingress_file_to_s3() for 0.2 seconds after 1 tries, reason: HTTPError
[2025-09-23 16:21:24,665] INFO Thread-9 (worker) ingress_file_to_s3 : Batch 0 : Ingesting mflat.703.19Dec20.fits.xml to https://pds-staging.s3.amazonaws.com/mflat.703.19Dec20.fits.xml
[2025-09-23 16:21:24,667] WARNING Thread-9 (worker) backoff_handler : Backing off ingress_file_to_s3() for 1.2 seconds after 2 tries, reason: HTTPError
[2025-09-23 16:21:25,832] INFO Thread-9 (worker) ingress_file_to_s3 : Batch 0 : Ingesting mflat.703.19Dec20.fits.xml to https://pds-staging.s3.amazonaws.com/mflat.703.19Dec20.fits.xml
[2025-09-23 16:21:25,833] WARNING Thread-9 (worker) backoff_handler : Backing off ingress_file_to_s3() for 1.8 seconds after 3 tries, reason: HTTPError
[2025-09-23 16:21:27,644] INFO Thread-9 (worker) ingress_file_to_s3 : Batch 0 : Ingesting mflat.703.19Dec20.fits.xml to https://pds-staging.s3.amazonaws.com/mflat.703.19Dec20.fits.xml
[2025-09-23 16:21:27,720] INFO Thread-9 (worker) ingress_file_to_s3 : Batch 0 : mflat.703.19Dec20.fits.xml Ingest complete

Typically, log messages pertaining to backoff and retry can be safely ignored if upload is eventually succesful, as in the above example. However, if an upload ultimately fails after all retries are exhausted it could indicate a more serious problem that needs to be investigated:

...
[2025-09-23 16:31:47,231] WARNING Thread-9 (worker) backoff_handler : Backing off ingress_file_to_s3() for 30.9 seconds after 6 tries, reason: HTTPError
[2025-09-23 16:32:18,099] INFO Thread-9 (worker) ingress_file_to_s3 : Batch 0 : Ingesting mflat.703.19Dec20.fits.fz to https://pds-staging.s3.amazonaws.com/mflat.703.19Dec20.fits.fz
[2025-09-23 16:32:18,101] WARNING Thread-9 (worker) backoff_handler : Backing off ingress_file_to_s3() for 23.2 seconds after 7 tries, reason: HTTPError
[2025-09-23 16:32:41,324] INFO Thread-9 (worker) ingress_file_to_s3 : Batch 0 : Ingesting mflat.703.19Dec20.fits.fz to https://pds-staging.s3.amazonaws.com/mflat.703.19Dec20.fits.fz
[2025-09-23 16:32:41,326] WARNING Thread-9 (worker) backoff_handler : Backing off ingress_file_to_s3() for 54.8 seconds after 8 tries, reason: HTTPError
[2025-09-23 16:33:36,086] INFO Thread-9 (worker) ingress_file_to_s3 : Batch 0 : Ingesting mflat.703.19Dec20.fits.fz to https://pds-staging.s3.amazonaws.com/mflat.703.19Dec20.fits.fz
[2025-09-23 16:33:36,087] ERROR Thread-9 (worker) _process_batch : Batch 0 : Ingress failed for mflat.703.19Dec20.fits.fz, Reason: 403 Client Error

Any files that fail to upload after all retries are exhausted are reattempted in one final attempt at the end of DUM client execution:

...
[2025-09-23 16:33:36,094] INFO MainThread main : All batches processed
[2025-09-23 16:33:36,094] INFO MainThread main : ----------------------------------------
[2025-09-23 16:33:36,094] INFO MainThread main : Reattempting ingress for failed files...
[2025-09-23 16:33:36,096] INFO Thread-16 (worker) _prepare_batch_for_ingress : Batch 0 : Preparing for ingress
[2025-09-23 16:33:36,096] INFO Thread-16 (worker) _prepare_batch_for_ingress : Batch 0 : Prep completed in 0.00 seconds
[2025-09-23 16:33:36,108] INFO Thread-23 (worker) request_batch_for_ingress : Batch 0 : Requesting ingress
[2025-09-23 16:33:36,732] INFO Thread-23 (worker) request_batch_for_ingress : Batch 0 : Ingress request completed in 0.62 seconds
[2025-09-23 16:33:36,734] INFO Thread-23 (worker) ingress_file_to_s3 : Batch 0 : Ingesting mflat.703.19Dec20.fits.fz to https://pds-sbn-staging-dev.s3.amazonaws.com/mflat.703.19Dec20.fits.fz

Files that still fail to upload during this final attempt are recorded in the final summary report:

Ingress Summary Report for 2025-09-23 16:35:37.532468
-----------------------------------------------------
Uploaded: 0 file(s)
Skipped: 0 file(s)
Failed: 1 file(s)
Unprocessed: 0 file(s)
Total: 1 files(s)
Time elapsed: 244.87 seconds
Bytes transferred: 0

Should persistent failures like this occur, they should be communicated to the PDS Operations team for investigation.