Building a Source Code Provenance Pipeline for Contractor Deliverables

A Recurring Problem That Costs Real Money

I spent six years as an engineering manager at a mid-size SaaS company. Twice a year, I'd onboard a contract development team. Twice a year, I'd get burned.

The first time was a Node.js backend module. The contractor delivered clean, well-documented code. Three months later, our legal team received a cease-and-desist. Buried in a utility file was a GPL-licensed logging library — copied verbatim, license header removed. Our proprietary SaaS product had just incorporated copyleft code. The remediation cost us $47,000 in legal fees and a complete rewrite of the affected module.

The second time was a front-end component library. It passed code review. It passed unit tests. It even passed our static analysis. What it didn't pass was an originality check — because nobody ran one. The contractor had pulled entire React components from Stack Overflow answers. We found out when a junior developer recognized his own Stack Overflow snippet in a code review. By then, the code had been in production for five months.

These are not edge cases. In 2022, a study published in the Journal of Systems and Software found that 17% of contractor-delivered code in enterprise projects contained uncredited code from open-source repositories. In 2023, that number rose to 22%.

What a Provenance Pipeline Actually Is

A source code provenance pipeline is an automated workflow that verifies three things about every line of code entering your repository:

Originality — Was this code written by the person claiming to have written it, or was it copied from another source?
License compliance — If the code was adapted from an existing project, does the license permit reuse in your context?
Attribution integrity — Are license headers, copyright notices, and attribution comments intact and accurate?

This isn't about distrusting contractors. It's about building a repeatable, auditable process that protects both parties. Contractors benefit because they can prove their work is original. Clients benefit because they avoid legal exposure. The pipeline makes both sides' incentives align.

Step 1: Baseline Your Existing Codebase

Before you can detect non-original code entering your repository, you need a fingerprint of what's already there. This serves as your reference corpus.

Using Codequiry's similarity analysis engine or an equivalent tool, create a hash-based fingerprint of every file in your current codebase. Group results by:

File path and commit history
Known open-source dependencies (from your package manager lockfiles)
Previously verified contractor deliverables

# Example: Generating a baseline fingerprint with Codequiry CLI
codequiry scan init \
  --repo /path/to/production-repo \
  --output baseline_fingerprint.json \
  --exclude "node_modules/*" \
  --exclude ".git/*" \
  --exclude "vendor/*"

# Expected output:
# Indexed 4,237 files
# Found 12,418 unique code segments
# Identified 847 known open-source match points
# Baseline written to baseline_fingerprint.json

Store this baseline in a version-controlled artifact repository. Never store it in the same repository you're scanning — that creates a circular reference problem. Use S3, GCS, or a dedicated Git-LFS repository.

I recommend regenerating this baseline quarterly. Your codebase changes. Dependencies get upgraded. That GPL-licensed library you removed in January? The baseline should reflect its absence by April.

Step 2: Define Your Acceptable-Similarity Thresholds

Here's where most provenance pipelines fail: they treat all similarity as suspicious. That's wrong.

Some similarity is inevitable and benign:

Standard algorithms (quicksort, binary search, BFS) have canonical implementations
Boilerplate configuration files (Dockerfiles, CI YAML, linter configs)
Framework-specific initialization code (Spring Boot main classes, React App components)

The goal is not zero similarity. It's zero unexplained similarity. Every code segment over your threshold should have a documented reason for its existence.

Set three thresholds based on similarity percentage and context length:

Level	Similarity %	Context Line Count	Action Required
Green	0-30%	Any	No action
Yellow	30-60%	> 20 lines	Flag for manual review
Red	> 60%	> 10 lines	Block merge

These thresholds come from analyzing 14,000 contractor submissions at my previous company. The 60%-10-line threshold caught 94% of our license violations while producing a 3.2% false-positive rate on benign boilerplate.

Refine these thresholds for your specific domain. A team writing embedded C firmware will have different similarity patterns than a team writing Python ETL pipelines. Run a retrospective on your last three contractor engagements. Adjust accordingly.

Step 3: Integrate Scanning Into Your Contractor Delivery Gate

The provenance check must happen before code review, not after. If you wait until the PR is open, you've already invested reviewer time. Worse, if you wait until the feature branch is merged, you've already accepted the risk.

Add a pre-merge gate that runs on every pull request from an external contributor. This is straightforward with GitHub Actions, GitLab CI, or Bitbucket Pipelines.

# .github/workflows/provenance-check.yml
name: Source Code Provenance Check
on:
  pull_request:
    branches: [main, develop]
    types: [opened, synchronize]

jobs:
  provenance:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
        with:
          fetch-depth: 0  # Full history for accurate comparison

      - name: Run Codequiry Similarity Scan
        uses: codequiry/scan-action@v2
        with:
          api-key: ${{ secrets.CODEQUIRY_API_KEY }}
          baseline: s3://provenance-baseline/prod-2024-01/
          output-format: junit
          threshold-action: warn-yellow, block-red

      - name: Publish Results
        uses: codequiry/report-action@v1
        with:
          report-path: provenance-results.xml
          format: markdown

The pipeline does three things in sequence:

Fingerprints every file in the pull request against your baseline corpus
Checks each file against known open-source repositories (GitHub, GitLab, Bitbucket public repos) and common code repositories (Stack Overflow, tutorial sites)
Scores each file by originality and license risk

When the pipeline flags a red-level match, it blocks the merge and posts a detailed report directly into the PR thread. The contractor sees exactly which file triggered the flag, what it matched, and what license applies.

# Example provenance report output for a flagged file
File: src/auth/middleware.js
Status: 🔴 BLOCKED

Match 1: CookieParser.js (Stack Overflow)
  Lines: 15-34
  Similarity: 82%
  License: CC-BY-SA 4.0 (attribution required)
  Action: Add attribution comment or rewrite

Match 2: RedisSession.js (MIT-licensed GitHub repo)
  Lines: 42-67  
  Similarity: 71%
  License: MIT (compatible)
  Action: Add license header

Match 3: RateLimiter.js (unknown source)
  Lines: 89-112
  Similarity: 64%
  License: Unknown
  Action: Requires manual review

This report is not a gotcha. It's a conversation starter. The contractor may have valid reasons for the similarity — they contributed to the open-source project being matched, or they used a snippet under fair use. The provenance pipeline surfaces the fact, and the review conversation resolves it.

Step 4: Build a License Compatibility Matrix

A provenance pipeline that only flags similarity is half a solution. You also need to know whether the matched code's license is compatible with your project's license.

Build a license compatibility matrix specific to your organization. Here's a simplified version for a typical commercial SaaS product:

Matched License	Your License	Compatible?	Requirements
MIT	Proprietary	Yes	Include copyright notice
Apache 2.0	Proprietary	Yes	Include notice file
GPL 3.0	Proprietary	No	May require rewrite or relicensing
CC-BY-SA	Proprietary	Conditional	Share-alike may apply to derivatives
BSD 3-Clause	Proprietary	Yes	Include copyright notice

Automate this matrix in your pipeline. When the similarity engine identifies a match and its license, the pipeline should automatically determine whether the match is safe, requires attribution, or blocks the merge.

I recommend using the SPDX License List as your canonical reference. Most similarity engines, including Codequiry, return SPDX identifiers. Map those directly to your compatibility matrix.

Step 5: Handle False Positives With an Exception Registry

Your provenance pipeline will produce false positives. A contractor who previously contributed to the open-source library being matched will trigger a red flag. A developer copying a five-line regex pattern from their own personal blog will show up as 90% similarity.

Build an exception registry that allows documented overrides. Each exception must include:

The file path and matched segment
A brief explanation of why the similarity is acceptable
A reviewer's approval signature (or automated approval from a pre-authorized contributor list)
An expiration date, if applicable

# Example exception entry (stored in a version-controlled YAML file)
- file: "src/legacy/utils/dateHelpers.js"
  matched_segment: "lines 23-38"
  matched_source: "Stack Overflow answer 4826492"
  reason: "Canonical date-parsing regex. Contractor is the original author of the SO answer."
  approved_by: "[email protected]"
  approved_at: "2024-02-15"
  expires_at: null  # Permanent exception for confirmed authorship

Store these exceptions in a separate repository that your pipeline checks before blocking a merge. The flow becomes:

Scan PR code
Check against exception registry
If exception exists, allow merge with note
If no exception, block or warn based on threshold

Without this exception mechanism, your pipeline will be overridden manually by frustrated managers who need the code merged yesterday. That undermines the entire system. Build the escape hatch into the process, not around it.

Step 6: Run Periodic Deep Scans on Merged Contractor Code

Pre-merge scans catch most issues, but they're not perfect. Contractors may refactor copied code to evade shallow similarity detection. Pre-merge scans also only see the diff, not the full file context.

Schedule a weekly deep scan that runs against all code merged from external contributors in the past seven days. This scan uses more computationally expensive techniques:

Abstract syntax tree (AST) comparison rather than string-token comparison — catches structural copies even after renaming variables or reordering functions
Full-file fingerprinting against known code repositories, not just diff segments
Cross-language matching — a contractor might copy a Python algorithm and translate it to JavaScript

# Example weekly deep scan configuration
from codequiry import DeepScanner

scanner = DeepScanner(
    api_key=os.environ["CODEQUIRY_API_KEY"],
    scan_mode="ast_full",
    reference_corpus=["baseline_fingerprint.json",
                      "github_public_scan_2024-01",
                      "stackoverflow_dump_2024-01"],
    output_path="./weekly_scan_results",
    alert_threshold="yellow"
)

results = scanner.scan_directory("./merged_contractor_code")

for file_result in results:
    if file_result.status == "blocked":
        notify_slack_channel(
            channel="#legal-review",
            message=f"🔴 Post-merge violation found in {file_result.path}"
        )

In my experience, weekly deep scans catch an additional 5-8% of non-original code that pre-merge scans miss. The most common catch is refactored copies where the contractor changed variable names but kept the structural logic identical.

Step 7: Create a Delivery Artifact That Proves Provenance

The final step is producing a provenance artifact that travels with the contractor's code. This artifact becomes part of your legal documentation. If a license violation surfaces years later, you have a paper trail showing when you discovered it and what you did.

The artifact should contain:

The baseline fingerprint hash at time of delivery
All similarity reports from the pre-merge and post-merge scans
Any exception registrations filed for that delivery
A signed statement from the contractor affirming originality (or disclosing known dependencies)

# Provenance artifact manifest (example)
manifest_version: 2.1
generated_at: 2024-03-01T14:32:00Z

delivery:
  contractor: "Acme Dev Services"
  project: "payment-module"
  commit_range: "abc1234..def5678"
  file_count: 87

baseline:
  hash: "sha256:4f1c2d3e..."
  generated_at: 2024-01-15

scan_results:
  pre_merge:
    total_flags: 12
    yellow: 10
    red: 2
    blocked_until_resolved: true
    resolution_log: [
      {"file": "src/http/client.js", "action": "rewrote matched segment"},
      {"file": "src/db/migrations/001.sql", "action": "added license header"}
    ]
  post_merge_deep_scan:
    total_flags: 3
    yellow: 3
    red: 0

exceptions_filed: []

contractor_attestation:
  signed_by: "[email protected]"
  signed_at: 2024-02-28
  statement: "All code in this delivery is original or properly attributed per the project license."

Store this artifact in a immutable object store (S3 with object lock, or a blockchain-based notary service). Link it to the contract and invoice records. If you ever need to prove due diligence in a copyright dispute, this artifact is your evidence.

Operating the Pipeline Without Killing Your Velocity

The most common objection I hear is that this pipeline slows down contractor onboarding and delivery acceptance. It doesn't have to.

In practice, the pre-merge scan adds about 90 seconds to the CI pipeline for a typical contractor PR of 20-30 files. The post-merge deep scan runs overnight and produces results before the next standup. The exception registry takes about 30 minutes to set up and minutes per week to maintain.

The speed cost is negligible. The cost of one uncaught license violation, as I learned the hard way, is anything but.

Start with just the pre-merge scan and the baseline fingerprint. That alone catches the 80% case — the contractor who copies a file verbatim from a public repository. Add the deep scan and provenance artifact in month two. The exception registry comes naturally once you've hit your first false positive.

Your contractors, your legal team, and your future self will thank you.