Case study · Legal & public records

How we ingest, store and maintain 242,101 Swiss federal laws in 4 languages.

A production pipeline built and operated by B2B Connection that scrapes the entire body of Swiss federal legislation and court decisions across all 26 cantons in German, French, Italian and Romansh — then keeps it fresh on a four-times-a-week cron with automated validation, proxy rotation and PDF text extraction at ~12,000 documents per hour.

Total laws indexed: 242,101
PDFs stored in S3: 50,996
PDF coverage: 98.6%
Cantons covered: 26 / 26
Languages: 4 (DE / FR / IT / RM)
Refresh cadence: 4× / week
Storage backend: PostgreSQL + AWS S3
Avg. text-extraction throughput: ~12k PDFs / hr

The brief

The client — a Swiss legal-tech company — needed a single, authoritative datastore for every piece of federal legislation and every reasoned court decision published in Switzerland, structured for full-text search and queryable by canton, language, date and article number. The two upstream public sources (the federal legislation portal and the cantonal court-decision portal) publish complete catalogues, but in different formats, at different cadences, with different rate limits and with no joint identifier. Off-the-shelf scraping tools could pull a snapshot, but none could keep it accurate over a multi-year horizon without weekly engineering intervention.

B2B Connection was hired to design and operate the pipeline end-to-end — discovery, extraction, normalisation, storage, validation and delivery — as a managed service.

Architecture, step by step

Six stages, each one independently restartable, observable and idempotent. The same primitives appear in every large-scale dataset B2B Connection operates.

01
Source discovery & enumeration
Two upstream public sources: the federal legislation portal (the official Swiss Federal Chancellery legal portal) for legislation, and the cantonal court-decision portal for reasoned court decisions across every cantonal and federal court. We enumerate each catalogue via the published JSON manifest, stream-parse with ijson so the worker stays under 200 MB RAM even when the manifest is 2 GB+, and reconcile against the previous run to detect new, updated or withdrawn records.
02
Concurrent download with proxy rotation
Records are queued into a ThreadPoolExecutor (tuned per source — the federal legislation portal tolerates 32 concurrent connections, the cantonal court-decision portal only 8 before rate-limiting). Each request goes through a rotating residential proxy pool with per-request retry-after handling. A circuit-breaker pauses the worker if the source returns >5% 5xx in any 60-second window — preventing us from getting our IPs banned during incidents on their end.
03
PDF text extraction (PyMuPDF / pypdfium2 fallback)
Every PDF is opened with PyMuPDF for fast text extraction. For the ~3% of documents that fail (scanned old laws, malformed PDFs, encrypted forms) we fall back to pypdfium2 with OCR enabled. Text is normalised to UTF-8 NFC, language-tagged from the document metadata, and chunked by article for downstream search.
04
Persistence to PostgreSQL (Neon) + S3
Raw PDFs go to S3 with content-addressable keys (sha256 of the file body), so re-uploads are free and dedup is automatic. Structured fields — title, canton_code, language, enactment date, version, full-text — land in PostgreSQL on Neon, with full-text search indexes for each language using the appropriate stemmer (german, french, italian, simple for Romansh).
05
Cron schedule with idempotent runs
Monday/Thursday at 02:00 UTC: legislation sync from the federal legislation portal. Tuesday/Saturday at 02:00 UTC: court-decision sync from the cantonal court-decision portal. Every job is idempotent — a re-run produces no duplicate rows and only re-downloads files whose upstream ETag has changed. A failed run can be resumed by simply re-triggering it.
06
Automated validation & drift detection
After every sync a validation job (validation_<timestamp>.json) cross-checks counts in three places: the upstream manifest, the PostgreSQL row count, and the S3 object count. Discrepancies above the 1% threshold raise an alert. The current snapshot (Mar 2026) shows 50,996 of 51,724 expected PDFs in S3 — 728 outstanding, all logged with their failure reason for the next reconciliation pass.

The technology stack

Boring, well-understood, production-tested tools. Every layer chosen because it has survived years of real workload at scale, not because it was new.

Layer	Choice
Language	Python 3.11
HTTP client	requests + urllib3 retry adapter
Concurrency	ThreadPoolExecutor (CPU-bound) + asyncio (I/O)
JSON parsing	ijson (streaming, low memory)
PDF extraction	PyMuPDF, pypdfium2 fallback, Tesseract OCR
Database	PostgreSQL 16 on Neon (serverless, branch-per-env)
Object storage	AWS S3 (content-addressable, lifecycle to Glacier after 90d)
Proxy rotation	Bright Data residential pool + custom rotator
Scheduling	Linux cron + systemd-timer, with PagerDuty on failure
Validation	Custom JSON-report differ vs. previous run
Observability	Structured JSON logs → Grafana Loki

What this pipeline says about B2B Connection's capability

This Swiss legal stack is one of dozens of long-running pipelines we operate. It happens to be the cleanest single illustration of the patterns we apply to every large dataset.

Scale: 200,000 → 10,000,000+ record pipelines

This Swiss legal stack is sized for ~250k records. The same architecture pattern — streaming parsers, idempotent workers, content-addressable storage — powers our Australian Google Maps dataset at 6.4 million venues and a US trades dataset at 2.4 million records. We design every pipeline so it can scale by an order of magnitude without a rewrite.

Long-term maintenance, not one-off scrapes

Anyone can run a scrape once. The hard part is keeping a dataset accurate for years. Every B2B Connection pipeline ships with a cron schedule, a validation job, and a recovery runbook. This Swiss legal pipeline has been running unattended for over a year — no manual intervention required for the last 9 months of operation.

Anti-bot & proxy infrastructure

Residential proxy pools, fingerprint rotation, header randomisation, intelligent backoff, and Cloudflare-friendly request shaping. We have working extractors for sources that block headless browsers outright — Realtor.com, TripAdvisor, Angi, Yelp, Booking.com, Indeed and a long tail of regional directories.

Multi-format ingestion (HTML, JSON, XML, PDF, DOCX)

Swiss federal laws are PDFs. Google Maps is JSON. Realtor.com is server-rendered HTML. Some EU registers are still XML feeds. Our pipelines handle all of it with the same primitives: stream the source, parse incrementally, dedupe by stable identifier, persist to a typed schema.

Database-first delivery (not just CSV dumps)

Most clients want a CSV — and they get one. But the underlying pipeline writes to PostgreSQL with proper indexes, foreign keys and full-text search. We can deliver directly into your Postgres, BigQuery, Snowflake, S3 or Azure Blob storage, or expose a read-only REST API if your team prefers to pull on demand.

Compliance-aware sourcing

Every document in this corpus is public-record government data — explicitly licensed for redistribution by the Swiss Federal Chancellery and the publishing cantonal courts. We only scrape sources where the legal basis is clear: public registers, government portals, and sites whose terms permit data extraction. Privacy law (GDPR, Australian Privacy Act 1988) is reviewed at scope-definition time, not after the fact.

Outcomes after 12 months in production

Zero manual interventions in the last 9 months of operation — the cron + validation pair has been self-sufficient.
98.6% PDF coverage against the upstream manifest — 50,996 of 51,724 expected files in S3, with the remaining 728 individually logged for the next reconciliation pass.
Sub-second full-text search across the full 242k-document corpus, in any of the four languages, via PostgreSQL's language-specific stemmers.
Predictable infrastructure cost — under USD $400 per month total (Neon compute, S3 storage, proxy traffic) at this scale.
Source changes absorbed without downtime — the upstream source changed its manifest schema twice in the last 12 months; both were patched inside 24 hours without a missed sync.

Large-scale scraping FAQ

How large a dataset can B2B Connection handle?

Our largest production pipeline today processes ~6.4 million records on a weekly refresh. This Swiss legal pipeline handles ~250,000 documents (laws + court decisions) with PDF storage in S3. We've designed prototypes for the 50-100 million record range and have no architectural blockers up to that scale — the limiting factor is usually the source's tolerance for traffic, not our infrastructure.

How do you keep a scraped dataset up to date over months and years?

Every pipeline ships with three components: a cron schedule (typically 2-7 runs per week), an idempotent sync job that only re-downloads what's changed, and an automated validation job that cross-checks counts against the upstream source. If the validation report shows drift above a 1% threshold, the on-call engineer is paged. This Swiss legal pipeline has run for over a year with no manual intervention in the last 9 months.

What happens when the source site changes its structure?

Two layers of defence. First, every extractor has a schema-validation step that fails loudly on the first run after a breaking change — we know within hours, not weeks. Second, we maintain selector tests against cached fixtures so we can rebuild an extractor without needing to scrape live during development. Typical turnaround for a source-side restructure is one to three business days.

Can you scrape sites that use Cloudflare, hCaptcha or aggressive bot detection?

Yes, for sources where the data is publicly accessible to a human visitor. We use rotating residential proxy pools, realistic browser fingerprints, request shaping that mirrors human pacing, and selective use of full browser automation (Playwright) when JS-rendered content is unavoidable. We do not bypass paywalls, login-required content, or any source whose Terms of Service explicitly prohibit automated access.

How do you handle large PDF or document corpuses?

PyMuPDF for fast text extraction (~12,000 PDFs per hour on a single 4-core worker), with pypdfium2 as a fallback for malformed files and Tesseract OCR for scanned-only documents. Storage is content-addressable in S3 (sha256-keyed) so re-uploads are free and dedup is automatic. We've processed 50,000+ PDF corpuses for legal, planning-permit and tender-document clients.

Where does the data end up after extraction?

Choose your delivery: a CSV/XLSX bundle for one-off purchases, direct writes into your PostgreSQL / BigQuery / Snowflake / Redshift warehouse, S3 or Azure Blob drops in your bucket, a read-only REST API hosted by us, or a webhook stream of new/updated records. Most B2B Connection enterprise clients use the warehouse-write option so the data lands in their existing reporting stack.

What does B2B Connection guarantee about data quality?

Three guarantees: (1) every record is deduplicated against a stable identifier — ABN, sha256 of source URL, government register ID — so the same entity never appears twice; (2) every record carries a last_verified timestamp so downstream consumers can filter on freshness; (3) every refresh runs a validation job that compares counts against the upstream source and flags drift. If validation fails, the dataset is held back from delivery until reconciled.

Is this Swiss legal dataset for sale on B2B Connection?

No — this is a confidential client project shown here as a capability reference. The dataset itself is owned by the client; B2B Connection built and operates the pipeline. If you need a similar pipeline for your own jurisdiction or vertical (public registers, court decisions, regulatory filings, planning permits, tenders) we can quote a custom build.

How long does a custom large-scale scraping project take?

Discovery and proof-of-concept (one extractor, ~1,000 sample records): 5-10 business days. Full production pipeline with cron, validation, storage and delivery: typically 4-8 weeks depending on source complexity and target scale. This Swiss legal pipeline took 6 weeks from initial scoping to first production run.

What does a project like this cost?

Custom pipelines start at AUD $8,000 for a single-source, single-format extractor with monthly delivery. Multi-source pipelines with PDF processing, proxy infrastructure and weekly refreshes (the shape of this Swiss legal project) typically land between AUD $25,000 and $60,000 for the build, with ongoing maintenance at AUD $400-1,500 per month depending on volume and refresh frequency.

Have a dataset this size — or larger?

B2B Connection builds and operates production scraping pipelines from ~200k records up to the tens of millions. Tell us the source, the target schema and the refresh cadence — we'll quote a build inside one business day.

Scope a custom pipeline

Or browse other recent projects.

Headline metrics

The brief

Architecture, step by step

Source discovery & enumeration

Concurrent download with proxy rotation

PDF text extraction (PyMuPDF / pypdfium2 fallback)

Persistence to PostgreSQL (Neon) + S3

Cron schedule with idempotent runs

Automated validation & drift detection