🏃 ️Usage
This section describes how to use the PDS DUM client script.
pds-ingress-client Description
Upon installation of the DUM application, the pds-ingress-client script will
be available for use on the command-line.
To see the usage documentation, you can run the following:
$ pds-ingress-client --help
Command-Line Arguments Reference
Client side script used to perform ingress request to the DUM service in AWS.
usage: pds-ingress-client [-h] [-c CONFIG_PATH]
-n {atm,eng,geo,img,naif,ppi,rs,rms,sbn}
[--prefix PREFIX] [--weblogs LOG_TYPE]
[--force-overwrite] [--include PATTERN]
[--exclude PATTERN] [--num-threads NUM_THREADS]
[--log-path LOG_PATH]
[--manifest-path MANIFEST_PATH]
[--report-path REPORT_PATH] [--dry-run]
[--skip-symlinks]
[--log-level {warn,warning,info,debug}] [--version]
file_or_dir [file_or_dir ...]
Positional Arguments
- file_or_dir
One or more paths to the files to ingest to S3. For each directory path is provided, this script will automatically derive all sub-paths for inclusion with the ingress request.
Named Arguments
- -c, --config-path
Path to the INI config for use with this client. If not provided, the default config (/github/workspace/src/pds/ingress/conf.default.ini) is used.
- -n, --node
Possible choices: atm, eng, geo, img, naif, ppi, rs, rms, sbn
PDS node identifier of the ingress requestor. This value is used by the Ingress service to derive the S3 upload location. Argument is case-insensitive.
- --prefix, -p
Specify a path prefix to be trimmed from each resolved ingest path such that is is not included with the request to the Ingress Service. For example, specifying –prefix “/home/user” would modify paths such as “/home/user/bundle/file.xml” to just “bundle/file.xml”. This can be useful for controlling which parts of a directory structure should be included with the S3 upload location returned by the Ingress Service.
- --weblogs, -w
Denotes the upload request as being for LOG_TYPE web logs. All uploaded files will be routed to a special S3 location reserved for web log files. LOG_TYPE denotes the type of web logs being uploaded, and becomes part of the destination upload path. If provided, –prefix must be provided as well.
- --force-overwrite, -f
By default, the DUM service determines if a given file has already been ingested to the PDS Cloud and has not changed. If so, ingress of the file is skipped. Use this flag to override this behavior and forcefully overwrite any existing versions of files within the PDS Cloud.
Default:
False- --include, -i
Specify a file path pattern to match against when determining which files should be included with an Ingress request. Unix-style wildcard patterns are supported. Include patterns are always applied prior to any Exclude patterns. This argument can be specified multiple times to configure multiple include patterns. Include patterns are evaluated in the order they provided.
Default:
[]- --exclude, -e
Specify a file path pattern to match against when determining which files should be excluded from an Ingress request. Unix-style wildcard patterns are supported. Exclude patterns are always applied after any Include patterns. This argument can be specified multiple times to configure multiple exclude patterns. Exclude patterns are evaluated in the order they provided.
Default:
[]- --num-threads, -t
Specify the number of threads to use when uploading files to S3 in parallel. By default, all available cores are used.
Default:
4- --log-path
Specify a file path to write logging statements to. These will include some of the messages logged to the console, as well as additional messages about the status of each file/batch transfer. By default, the log file is created in a temporary location if this parameter is not provided. If provided, this argument takes precedence over what is provided for OTHER.log_file_path in the INI config.
- --manifest-path
Specify a file path to a JSON manifiest of all files indexed for inclusion in the current ingress request. If the provided path is not an existing file, then the manifest will be written to that location. If the path already exists, this script will read the manifiest, and skip checksum generation for any paths that are already specified. If not provided, no manifiest is written or read.
- --report-path, -r
Specify a path to write a JSON summary report containing the full listing of all files ingressed, skipped or failed. By default, no report is created.
- --dry-run
Derive the full set of ingress paths without performing any submission requests to the server.
Default:
False- --skip-symlinks
Do not follow symbolic links when resolving ingress paths. Use this option to avoid uploading duplicate data when files are symlinked into multiple locations.
Default:
False- --log-level, -l
Possible choices: warn, warning, info, debug
Sets the Logging level for logged messages. If not provided, the logging level set in the INI config is used instead.
- --version
Print the Data Upload Manager release version and exit.
Usage Notes
The client application only has two arguments which must provided on each invocation,
at least one path to files or directories to be uploaded, and the name of the PDS
node the submission is on behalf of (via the --node argument).
When specifying the list of paths to be uploaded, any paths corresponding to
directories will be automatically recursed by the client script, and each file
within the directory will be included in the set uploaded to PDS Cloud. Any
sub-directories will be similarly recursed to find any additional files within,
until the entire directory tree is traversed. By default, symbolic links are
followed during path resolution, meaning that symlinked files and directories
will be included in the upload set. To prevent uploading duplicate data when
files are symlinked into multiple locations within a data delivery structure,
use the --skip-symlinks flag to skip symbolic links during traversal.
The client script provides --include and --exclude arguments which can
be used to filter the set of files included in the upload request. Both arguments
support Unix shell-style wildcards (e.g. *.xml), and can be specified multiple
times to include or exclude multiple patterns. The --include argument is applied
first, followed by the --exclude argument. If no --include arguments are
provided, all files are included by default.
Specifying the node ID of the requestor is accomplished via the --node argument,
specifying one of the following node name values: atm,eng,geo,img,naif,ppi,rs,rms,sbn
The --dry-run flag can be used to have pds-ingress-client determine the
full set of files and directories to be processed without actually submitting
anything for ingest to PDS Cloud. This feature can be useful to ensure the correct
set of files are being included for a request before performing any communication
with the Server side of DUM.
The details of where the client script should submit ingest requests to are configured
within an INI file. After installing DUM, a new INI config should be created with the
appropriate values for each field. Once available, the config for use with a request
can be specified via the --config-path argument. See the Client Configuration section
of the installation instructions for more details on creating the INI config.
In its default configuration, the pds-ingress-client script includes the full path
of each discovered file when sending the request to the Server side components. Typically,
this can include unwanted path components, such as a user’s home directory. To control
the path components included, the --prefix argument can be used to specify a path
prefix that will be trimmed from all file paths discovered by pds-ingress-client.
For example, if the file /home/user/bundle/file.xml were to be uploaded, and
--prefix /home/user were also provided, the path provided to the ingress service
in AWS would resolve to bundle/file.xml.
The --weblogs LOG_TYPE argument can be used to indicate that weblog files are being uploaded.
The type of log files (e.g. apache, nginx, etc.) should be specified as the
LOG_TYPE argument. When this flag is provided, the client will automatically adjust
the “trimmed” path for each ingested file to replace the specified path prefix with weblog/,
ensuring that all weblog files are routed to a specific S3 bucket used for web analytics.
Because of this, the --prefix argument must be provided when using the --weblogs argument.
Warning
All weblog files must be gzip-compressed (.gz extension). When using --weblogs,
the client validates that every file in the request has a .gz extension before
any uploads begin. If even a single non-gzipped file is present in the input file set,
the entire request will fail immediately with no files being uploaded.
For example, if you attempt to upload a directory containing both access.log.gz and
access.log.txt, the upload will abort with an error listing the non-gzipped files.
To successfully upload, either:
Compress all files with gzip before uploading, or
Use the
--excludeargument to filter out non-gzipped files (e.g.,--exclude "*.txt")
The pds-ingress-client by default utilizes all available CPUs on the
local machine to perform parallelized ingress requests to the ingress service. The exact
number of threads can be controlled via the --num-threads argument.
The client script also provides several arguments for reporting status of an ingress request:
--report-path: Specifies a path to write a detailed report file in JSON format, containing details about which files were uploaded, skipped or failed during transfer.--manifest-path: Specifies a path to write the manifest of all files included in ingress request, including file checksums. This option may also be used to provide an existing Manifest file to bypass its recomputation during subsequent requests.--log-path: Specifies a path to write a trace log of the ingress request which does not go to the console. Can be useful for troubleshooting tranfer failures.
Lastly, the --version option may be used to verify the version number of installed DUM client.
Data Upload Manager Client Workflow
When utilizing the DUM Client script (pds-ingress-client), the following workflow is executed:
Indexing of the requested input files/paths to determine the full input file set
Generation of a Manifest file, containing information, including MD5 checksums, of each file to be ingested
Batch ingress requesting of input file set to the DUM Ingress Service in AWS
Batch upload of input file set to AWS S3
Ingress report creation
Determination of the input file set is determined in Step 1 by resolving the paths providing on the command-line to the DUM client. Any directories provided are recursed to determine the full set of files within. Any paths provided are included as-is into the input file set.
Depending on the size of the input file set, the Manifest file creation in Step 2 can become time-consuming due to the hashing of each file in the input file set. To save time, the –manifest-path command-line option should be leveraged to write the contents of the Manifest to local disk. Specifying the same path via –manifest-path on subsequent executions of the DUM client will result in a read of the existing Manifest from disk. Any files within the input set referenced within the read Manifest will reuse the precomputed values within, saving upfront time prior to start of upload to S3. The Manifest will then be re-written to the path specified by –manifest-path to include any new files encountered. In this way, a Manifest file can expand across executions of DUM to serve as a sort of cache for file information.
Understanding Batch Processing
The batch size utilized by Steps 3 and 4 can be configured within the INI config provided to the
DUM client via the batch_size setting in the [OTHER] section. The number of batches processed
in parallel can be controlled via the --num-threads command-line argument.
During execution, the client splits the input file set into batches and processes them in parallel. Log messages clearly indicate this batching behavior:
INFO MainThread main : Using batch size of 250
INFO MainThread main : Request (500 files) split into 2 batches
Each batch is then prepared and uploaded independently. If individual files within a batch fail to upload (due to network issues, permission errors, etc.), those files are tracked separately and retried at the end of the ingress process. The remaining files in the batch continue to upload normally - a single file failure does not cause the entire batch to fail.
Note
The batching described here applies to the upload phase (Steps 3-4). Pre-flight validation checks (such as the gzip validation for weblog uploads) are performed on the entire input file set before batching occurs. If pre-flight validation fails, no files will be uploaded.
By default, at completion of an ingress request (Step 5), the DUM client provides a summary of the results of the transfer:
Ingress Summary Report for 2025-02-25 11:41:29.507022
-----------------------------------------------------
Uploaded: 200 file(s)
Skipped: 0 file(s)
Failed: 0 file(s)
Unprocessed: 0 file(s)
Total: 200 files(s)
Time elapsed: 3019.00 seconds
Bytes transferred: 3087368895
A more detailed JSON-format report, containing full listings of all uploaded/skipped/failed paths, can be written to disk via the –report-path command-line argument:
{
"Arguments": "Namespace(config_path='mcp.test.ingress.config.ini', node='sbn', prefix='/PDS/SBN/', force_overwrite=True, num_threads=4, log_path='/tmp/dum_log.txt', manifest_path='/tmp/dum_manifest.json', report_path='/tmp/dum_report.json', dry_run=False, log_level='info', ingress_paths=['/PDS/SBN/gbo.ast.catalina.survey/'])",
"Batch Size": 3,
"Total Batches": 67,
"Start Time": "2025-02-25 18:51:10.507562+00:00",
"Finish Time": "2025-02-25 19:41:29.504806+00:00",
"Uploaded": [
"gbo.ast.catalina.survey/data_calibrated/703/2020/20Apr02/703_20200402_2B_F48FC1_01_0001.arch.fz",
...
"gbo.ast.catalina.survey/data_calibrated/703/2020/20Apr02/703_20200402_2B_N02055_01_0001.arch.xml"
],
"Total Uploaded": 200,
"Skipped": [],
"Total Skipped": 0,
"Failed": [],
"Total Failed": 0,
"Unprocessed": [],
"Total Unprocessed": 0,
"Bytes Transferred": 3087368895,
"Total Files": 200
}
Lastly, a detailed log file containing trace statements for each file/batch uploaded can be written to disk via the –log-path command-line argument. The log file path may also be specifed within the INI config.
Automatic Retry of Failed Uploads
The DUM client script is configured to automatically retry any failed uploads to S3 using exponential backoff and retry. When an intermittent failure occurs during upload, messages pertaining to the backoff and retry are logged to the log file (which can be specified via the –log-path argument).
Here is an example of such log messages:
...
[2025-09-23 16:21:24,491] INFO Thread-9 (worker) ingress_file_to_s3 : Batch 0 : Ingesting mflat.703.19Dec20.fits.xml to https://pds-staging.s3.amazonaws.com/mflat.703.19Dec20.fits.xml
[2025-09-23 16:21:24,493] WARNING Thread-9 (worker) backoff_handler : Backing off ingress_file_to_s3() for 0.2 seconds after 1 tries, reason: HTTPError
[2025-09-23 16:21:24,665] INFO Thread-9 (worker) ingress_file_to_s3 : Batch 0 : Ingesting mflat.703.19Dec20.fits.xml to https://pds-staging.s3.amazonaws.com/mflat.703.19Dec20.fits.xml
[2025-09-23 16:21:24,667] WARNING Thread-9 (worker) backoff_handler : Backing off ingress_file_to_s3() for 1.2 seconds after 2 tries, reason: HTTPError
[2025-09-23 16:21:25,832] INFO Thread-9 (worker) ingress_file_to_s3 : Batch 0 : Ingesting mflat.703.19Dec20.fits.xml to https://pds-staging.s3.amazonaws.com/mflat.703.19Dec20.fits.xml
[2025-09-23 16:21:25,833] WARNING Thread-9 (worker) backoff_handler : Backing off ingress_file_to_s3() for 1.8 seconds after 3 tries, reason: HTTPError
[2025-09-23 16:21:27,644] INFO Thread-9 (worker) ingress_file_to_s3 : Batch 0 : Ingesting mflat.703.19Dec20.fits.xml to https://pds-staging.s3.amazonaws.com/mflat.703.19Dec20.fits.xml
[2025-09-23 16:21:27,720] INFO Thread-9 (worker) ingress_file_to_s3 : Batch 0 : mflat.703.19Dec20.fits.xml Ingest complete
Typically, log messages pertaining to backoff and retry can be safely ignored if upload is eventually succesful, as in the above example. However, if an upload ultimately fails after all retries are exhausted it could indicate a more serious problem that needs to be investigated:
...
[2025-09-23 16:31:47,231] WARNING Thread-9 (worker) backoff_handler : Backing off ingress_file_to_s3() for 30.9 seconds after 6 tries, reason: HTTPError
[2025-09-23 16:32:18,099] INFO Thread-9 (worker) ingress_file_to_s3 : Batch 0 : Ingesting mflat.703.19Dec20.fits.fz to https://pds-staging.s3.amazonaws.com/mflat.703.19Dec20.fits.fz
[2025-09-23 16:32:18,101] WARNING Thread-9 (worker) backoff_handler : Backing off ingress_file_to_s3() for 23.2 seconds after 7 tries, reason: HTTPError
[2025-09-23 16:32:41,324] INFO Thread-9 (worker) ingress_file_to_s3 : Batch 0 : Ingesting mflat.703.19Dec20.fits.fz to https://pds-staging.s3.amazonaws.com/mflat.703.19Dec20.fits.fz
[2025-09-23 16:32:41,326] WARNING Thread-9 (worker) backoff_handler : Backing off ingress_file_to_s3() for 54.8 seconds after 8 tries, reason: HTTPError
[2025-09-23 16:33:36,086] INFO Thread-9 (worker) ingress_file_to_s3 : Batch 0 : Ingesting mflat.703.19Dec20.fits.fz to https://pds-staging.s3.amazonaws.com/mflat.703.19Dec20.fits.fz
[2025-09-23 16:33:36,087] ERROR Thread-9 (worker) _process_batch : Batch 0 : Ingress failed for mflat.703.19Dec20.fits.fz, Reason: 403 Client Error
Any files that fail to upload after all retries are exhausted are reattempted in one final attempt at the end of DUM client execution:
...
[2025-09-23 16:33:36,094] INFO MainThread main : All batches processed
[2025-09-23 16:33:36,094] INFO MainThread main : ----------------------------------------
[2025-09-23 16:33:36,094] INFO MainThread main : Reattempting ingress for failed files...
[2025-09-23 16:33:36,096] INFO Thread-16 (worker) _prepare_batch_for_ingress : Batch 0 : Preparing for ingress
[2025-09-23 16:33:36,096] INFO Thread-16 (worker) _prepare_batch_for_ingress : Batch 0 : Prep completed in 0.00 seconds
[2025-09-23 16:33:36,108] INFO Thread-23 (worker) request_batch_for_ingress : Batch 0 : Requesting ingress
[2025-09-23 16:33:36,732] INFO Thread-23 (worker) request_batch_for_ingress : Batch 0 : Ingress request completed in 0.62 seconds
[2025-09-23 16:33:36,734] INFO Thread-23 (worker) ingress_file_to_s3 : Batch 0 : Ingesting mflat.703.19Dec20.fits.fz to https://pds-sbn-staging-dev.s3.amazonaws.com/mflat.703.19Dec20.fits.fz
Files that still fail to upload during this final attempt are recorded in the final summary report:
Ingress Summary Report for 2025-09-23 16:35:37.532468
-----------------------------------------------------
Uploaded: 0 file(s)
Skipped: 0 file(s)
Failed: 1 file(s)
Unprocessed: 0 file(s)
Total: 1 files(s)
Time elapsed: 244.87 seconds
Bytes transferred: 0
Should persistent failures like this occur, they should be communicated to the PDS Operations team for investigation.