A Recurring Problem That Costs Real Money
I spent six years as an engineering manager at a mid-size SaaS company. Twice a year, I'd onboard a contract development team. Twice a year, I'd get burned.
The first time was a Node.js backend module. The contractor delivered clean, well-documented code. Three months later, our legal team received a cease-and-desist. Buried in a utility file was a GPL-licensed logging library — copied verbatim, license header removed. Our proprietary SaaS product had just incorporated copyleft code. The remediation cost us $47,000 in legal fees and a complete rewrite of the affected module.
The second time was a front-end component library. It passed code review. It passed unit tests. It even passed our static analysis. What it didn't pass was an originality check — because nobody ran one. The contractor had pulled entire React components from Stack Overflow answers. We found out when a junior developer recognized his own Stack Overflow snippet in a code review. By then, the code had been in production for five months.
These are not edge cases. In 2022, a study published in the Journal of Systems and Software found that 17% of contractor-delivered code in enterprise projects contained uncredited code from open-source repositories. In 2023, that number rose to 22%.
What a Provenance Pipeline Actually Is
A source code provenance pipeline is an automated workflow that verifies three things about every line of code entering your repository:
- Originality — Was this code written by the person claiming to have written it, or was it copied from another source?
- License compliance — If the code was adapted from an existing project, does the license permit reuse in your context?
- Attribution integrity — Are license headers, copyright notices, and attribution comments intact and accurate?
This isn't about distrusting contractors. It's about building a repeatable, auditable process that protects both parties. Contractors benefit because they can prove their work is original. Clients benefit because they avoid legal exposure. The pipeline makes both sides' incentives align.
Step 1: Baseline Your Existing Codebase
Before you can detect non-original code entering your repository, you need a fingerprint of what's already there. This serves as your reference corpus.
Using Codequiry's similarity analysis engine or an equivalent tool, create a hash-based fingerprint of every file in your current codebase. Group results by:
- File path and commit history
- Known open-source dependencies (from your package manager lockfiles)
- Previously verified contractor deliverables
# Example: Generating a baseline fingerprint with Codequiry CLI
codequiry scan init \
--repo /path/to/production-repo \
--output baseline_fingerprint.json \
--exclude "node_modules/*" \
--exclude ".git/*" \
--exclude "vendor/*"
# Expected output:
# Indexed 4,237 files
# Found 12,418 unique code segments
# Identified 847 known open-source match points
# Baseline written to baseline_fingerprint.json
Store this baseline in a version-controlled artifact repository. Never store it in the same repository you're scanning — that creates a circular reference problem. Use S3, GCS, or a dedicated Git-LFS repository.
I recommend regenerating this baseline quarterly. Your codebase changes. Dependencies get upgraded. That GPL-licensed library you removed in January? The baseline should reflect its absence by April.
Step 2: Define Your Acceptable-Similarity Thresholds
Here's where most provenance pipelines fail: they treat all similarity as suspicious. That's wrong.
Some similarity is inevitable and benign:
- Standard algorithms (quicksort, binary search, BFS) have canonical implementations
- Boilerplate configuration files (Dockerfiles, CI YAML, linter configs)
- Framework-specific initialization code (Spring Boot main classes, React App components)
The goal is not zero similarity. It's zero unexplained similarity. Every code segment over your threshold should have a documented reason for its existence.
Set three thresholds based on similarity percentage and context length:
| Level | Similarity % | Context Line Count | Action Required |
|---|---|---|---|
| Green | 0-30% | Any | No action |
| Yellow | 30-60% | > 20 lines | Flag for manual review |
| Red | > 60% | > 10 lines | Block merge |
These thresholds come from analyzing 14,000 contractor submissions at my previous company. The 60%-10-line threshold caught 94% of our license violations while producing a 3.2% false-positive rate on benign boilerplate.
Refine these thresholds for your specific domain. A team writing embedded C firmware will have different similarity patterns than a team writing Python ETL pipelines. Run a retrospective on your last three contractor engagements. Adjust accordingly.
Step 3: Integrate Scanning Into Your Contractor Delivery Gate
The provenance check must happen before code review, not after. If you wait until the PR is open, you've already invested reviewer time. Worse, if you wait until the feature branch is merged, you've already accepted the risk.
Add a pre-merge gate that runs on every pull request from an external contributor. This is straightforward with GitHub Actions, GitLab CI, or Bitbucket Pipelines.
# .github/workflows/provenance-check.yml
name: Source Code Provenance Check
on:
pull_request:
branches: [main, develop]
types: [opened, synchronize]
jobs:
provenance:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
with:
fetch-depth: 0 # Full history for accurate comparison
- name: Run Codequiry Similarity Scan
uses: codequiry/scan-action@v2
with:
api-key: ${{ secrets.CODEQUIRY_API_KEY }}
baseline: s3://provenance-baseline/prod-2024-01/
output-format: junit
threshold-action: warn-yellow, block-red
- name: Publish Results
uses: codequiry/report-action@v1
with:
report-path: provenance-results.xml
format: markdown
The pipeline does three things in sequence:
- Fingerprints every file in the pull request against your baseline corpus
- Checks each file against known open-source repositories (GitHub, GitLab, Bitbucket public repos) and common code repositories (Stack Overflow, tutorial sites)
- Scores each file by originality and license risk
When the pipeline flags a red-level match, it blocks the merge and posts a detailed report directly into the PR thread. The contractor sees exactly which file triggered the flag, what it matched, and what license applies.
# Example provenance report output for a flagged file
File: src/auth/middleware.js
Status: 🔴 BLOCKED
Match 1: CookieParser.js (Stack Overflow)
Lines: 15-34
Similarity: 82%
License: CC-BY-SA 4.0 (attribution required)
Action: Add attribution comment or rewrite
Match 2: RedisSession.js (MIT-licensed GitHub repo)
Lines: 42-67
Similarity: 71%
License: MIT (compatible)
Action: Add license header
Match 3: RateLimiter.js (unknown source)
Lines: 89-112
Similarity: 64%
License: Unknown
Action: Requires manual review
This report is not a gotcha. It's a conversation starter. The contractor may have valid reasons for the similarity — they contributed to the open-source project being matched, or they used a snippet under fair use. The provenance pipeline surfaces the fact, and the review conversation resolves it.
Step 4: Build a License Compatibility Matrix
A provenance pipeline that only flags similarity is half a solution. You also need to know whether the matched code's license is compatible with your project's license.
Build a license compatibility matrix specific to your organization. Here's a simplified version for a typical commercial SaaS product:
| Matched License | Your License | Compatible? | Requirements |
|---|---|---|---|
| MIT | Proprietary | Yes | Include copyright notice |
| Apache 2.0 | Proprietary | Yes | Include notice file |
| GPL 3.0 | Proprietary | No | May require rewrite or relicensing |
| CC-BY-SA | Proprietary | Conditional | Share-alike may apply to derivatives |
| BSD 3-Clause | Proprietary | Yes | Include copyright notice |
Automate this matrix in your pipeline. When the similarity engine identifies a match and its license, the pipeline should automatically determine whether the match is safe, requires attribution, or blocks the merge.
I recommend using the SPDX License List as your canonical reference. Most similarity engines, including Codequiry, return SPDX identifiers. Map those directly to your compatibility matrix.
Step 5: Handle False Positives With an Exception Registry
Your provenance pipeline will produce false positives. A contractor who previously contributed to the open-source library being matched will trigger a red flag. A developer copying a five-line regex pattern from their own personal blog will show up as 90% similarity.
Build an exception registry that allows documented overrides. Each exception must include:
- The file path and matched segment
- A brief explanation of why the similarity is acceptable
- A reviewer's approval signature (or automated approval from a pre-authorized contributor list)
- An expiration date, if applicable
# Example exception entry (stored in a version-controlled YAML file)
- file: "src/legacy/utils/dateHelpers.js"
matched_segment: "lines 23-38"
matched_source: "Stack Overflow answer 4826492"
reason: "Canonical date-parsing regex. Contractor is the original author of the SO answer."
approved_by: "[email protected]"
approved_at: "2024-02-15"
expires_at: null # Permanent exception for confirmed authorship
Store these exceptions in a separate repository that your pipeline checks before blocking a merge. The flow becomes:
- Scan PR code
- Check against exception registry
- If exception exists, allow merge with note
- If no exception, block or warn based on threshold
Without this exception mechanism, your pipeline will be overridden manually by frustrated managers who need the code merged yesterday. That undermines the entire system. Build the escape hatch into the process, not around it.
Step 6: Run Periodic Deep Scans on Merged Contractor Code
Pre-merge scans catch most issues, but they're not perfect. Contractors may refactor copied code to evade shallow similarity detection. Pre-merge scans also only see the diff, not the full file context.
Schedule a weekly deep scan that runs against all code merged from external contributors in the past seven days. This scan uses more computationally expensive techniques:
- Abstract syntax tree (AST) comparison rather than string-token comparison — catches structural copies even after renaming variables or reordering functions
- Full-file fingerprinting against known code repositories, not just diff segments
- Cross-language matching — a contractor might copy a Python algorithm and translate it to JavaScript
# Example weekly deep scan configuration
from codequiry import DeepScanner
scanner = DeepScanner(
api_key=os.environ["CODEQUIRY_API_KEY"],
scan_mode="ast_full",
reference_corpus=["baseline_fingerprint.json",
"github_public_scan_2024-01",
"stackoverflow_dump_2024-01"],
output_path="./weekly_scan_results",
alert_threshold="yellow"
)
results = scanner.scan_directory("./merged_contractor_code")
for file_result in results:
if file_result.status == "blocked":
notify_slack_channel(
channel="#legal-review",
message=f"🔴 Post-merge violation found in {file_result.path}"
)
In my experience, weekly deep scans catch an additional 5-8% of non-original code that pre-merge scans miss. The most common catch is refactored copies where the contractor changed variable names but kept the structural logic identical.
Step 7: Create a Delivery Artifact That Proves Provenance
The final step is producing a provenance artifact that travels with the contractor's code. This artifact becomes part of your legal documentation. If a license violation surfaces years later, you have a paper trail showing when you discovered it and what you did.
The artifact should contain:
- The baseline fingerprint hash at time of delivery
- All similarity reports from the pre-merge and post-merge scans
- Any exception registrations filed for that delivery
- A signed statement from the contractor affirming originality (or disclosing known dependencies)
# Provenance artifact manifest (example)
manifest_version: 2.1
generated_at: 2024-03-01T14:32:00Z
delivery:
contractor: "Acme Dev Services"
project: "payment-module"
commit_range: "abc1234..def5678"
file_count: 87
baseline:
hash: "sha256:4f1c2d3e..."
generated_at: 2024-01-15
scan_results:
pre_merge:
total_flags: 12
yellow: 10
red: 2
blocked_until_resolved: true
resolution_log: [
{"file": "src/http/client.js", "action": "rewrote matched segment"},
{"file": "src/db/migrations/001.sql", "action": "added license header"}
]
post_merge_deep_scan:
total_flags: 3
yellow: 3
red: 0
exceptions_filed: []
contractor_attestation:
signed_by: "[email protected]"
signed_at: 2024-02-28
statement: "All code in this delivery is original or properly attributed per the project license."
Store this artifact in a immutable object store (S3 with object lock, or a blockchain-based notary service). Link it to the contract and invoice records. If you ever need to prove due diligence in a copyright dispute, this artifact is your evidence.
Operating the Pipeline Without Killing Your Velocity
The most common objection I hear is that this pipeline slows down contractor onboarding and delivery acceptance. It doesn't have to.
In practice, the pre-merge scan adds about 90 seconds to the CI pipeline for a typical contractor PR of 20-30 files. The post-merge deep scan runs overnight and produces results before the next standup. The exception registry takes about 30 minutes to set up and minutes per week to maintain.
The speed cost is negligible. The cost of one uncaught license violation, as I learned the hard way, is anything but.
Start with just the pre-merge scan and the baseline fingerprint. That alone catches the 80% case — the contractor who copies a file verbatim from a public repository. Add the deep scan and provenance artifact in month two. The exception registry comes naturally once you've hit your first false positive.
Your contractors, your legal team, and your future self will thank you.