This guide explains how to collect and submit documents in a way that preserves forensic value and improves detection reliability.
The main rule: submit the original file bytes
Submit the raw original file as received from the end user (or downstream customer). Avoid intake paths that re-save, re-encode, or reconstruct the document before analysis.
Do not submit files that were pre-processed by third-party tools unless you have no alternative.
Why this matters
Many systems silently modify documents (resizing, JPEG re-compression, stripping metadata, PDF reconstruction). These changes can reduce or eliminate forensic signals.
Common culprits:
- email gateways
- document management systems
- file sharing tools
- messaging apps (WhatsApp/Slack)
- “optimize” or “compress” functions
Practical guidance
| Do | Don’t |
|---|
| Collect the file via direct upload in your user flow | Route via messaging apps (often re-encodes images) |
| Store and forward the original bytes unchanged | Re-save documents in editors “to standardize” them |
| Upload binary bytes to the presigned URL | Base64-encode and re-wrap the file payload |
PDFs (best for automation and accuracy)
Digitally issued PDFs (e.g., bank statements, utility bills, forms) generally enable stronger detection and automation and are less sensitive to quality loss from re-encoding.
Images (more sensitive to quality)
Images require more care:
- authenticity is harder to determine
- pre-digital tampering (before scanning) may not be detectable without content analysis
- lower image quality increases false positives
Redaction guidance
Redaction and annotation creates a new modified document, which can reduce forensic value.
Key impacts:
- may change document structure/metadata and obscure evidence
- invalidates digital signatures/hashes used for authenticity verification
- some tools reconstruct PDFs in ways that increase false positives
If possible, analyze originals first and apply redaction only after analysis. If you must redact before analysis, keep the original stored separately for audit/debug (if allowed by your policy).
See Supported formats.
Go-live checklist
Before you go live, validate:
- You can capture and store original bytes end-to-end.
- Your pipeline does not re-encode, resize, reconstruct, or “optimize” documents.
- You have a strategy for low-quality image handling (re-request vs manual review).
- Your logs contain only non-PII identifiers (e.g.,
submission_id, safe query_id, timestamps, stage/cell).