Ingest Common Crawl index in ClickHouse
- Python 100%
| .gitignore | ||
| ccwget-local.py | ||
| ccwget.py | ||
| config.yaml.sample | ||
| create_idx.py | ||
| export_ccmain_fqdn.py | ||
| getlatest.py | ||
| ingest.py | ||
| LICENSE | ||
| live_record.py | ||
| README.md | ||
| requirements.txt | ||
| svc_ccwget.py | ||
CommonCrawl-Ingestor
########### XXXX ########## ___
####### XXXXXXXXXXXXXXXX ####### / ___|___ _ __ ___ _ __ ___ ___ _ __
#### XXXXXXXXXXXXXXXXXXXXXX #### | | / _ \| '_ ` _ \| '_ ` _ \ / _ \| '_ \
##### XXXXXXXXXXXXXXXXXXXX #### | |__| (_) | | | | | | | | | | | (_) | | | |
##### XXXXXXXXXXXXXXXX ##### \____\___/|_| |_| |_|_| |_| |_|\___/|_| |_|
#### XXXXXXXXXXX #### / ___|_ __ __ ___ _| |
XXXXXXXXXXX XXXXXXX X # | | | '__/ _` \ \ /\ / / |
XXXXXXXXXXX XX XXXXX# | |___| | | (_| |\ V V /| |
XXXXXXXXXXXX XXXXXXX \____|_| \__,_| \_/\_/ |_| _
XXXXXXXXXXX XXXXXXXXXXXX# __ __/ ___| ___| |_
XXXXXXXXXXX XXXXXXXXXXXXXXX # \ \ /\ / / | _ / _ \ __|
# XXXXXXXXXX XXXXXXXXXXXXXXXXX# \ V V /| |_| | __/ |_
## XXXXXXXXX XXXXXXXXXXXXXXX## \_/\_/ \____|\___|\__|
### XXXXXXXX XXXXXXXXXXX ###
##### XXXXXX XXXXXXXX ##### circl.lu
######## XXX ##### XXXX #######
CCWGET
Introduction
CCWGET is the cli client for the live Common Crawl API index hosted at CIRCL. It behave and stick to the basic same commmands and behaviour that a simple wget. However this tool is expanded with capabilities to use the data of Common Crawl indexes.
## Features CCWGET allows you to ;
- Download a web page from common crawl database
- Get a list of pages available for a given fqdn or domain.
- Search and enumerate Domains
- Search for a SHA1 hash of web page body.
### To download a file
Simply request as you will do with wget. You may use -O as output redirector.
$ ./ccwget-local.py https://circl.lu/pub/tr-73/
INFO: Parsed URL -> scheme: https, host: circl.lu, path: /pub/tr-73/, query: None
INFO: Latest CCMAIN table: CCMAIN202547
INFO: Fetching WARC segment: https://data.commoncrawl.org/crawl-data/CC-MAIN-2025-47/segments/1762439343283.41/warc/CC-MAIN-20251112191244-20251112221244-00771.warc.gz
INFO: Saved payload to index.html
Note:
- If the resource is not present in common crawl database the following behaviour occuers
- If the Scheme is not requested, the tool assume https://
$ ./ccwget-local.py https://circl.lu/pub/tr-44/
INFO: Parsed URL -> scheme: https, host: circl.lu, path: /pub/tr-44/, query: None
INFO: Latest CCMAIN table: CCMAIN202547
No matching records found.
You may display the Common Crawl WARC header and file header using the '-S' option.
$ ./ccwget-local.py https://circl.lu/pub/tr-73/ -S
INFO: Parsed URL -> scheme: https, host: circl.lu, path: /pub/tr-73/, query: None
INFO: Latest CCMAIN table: CCMAIN202547
INFO: Fetching WARC segment: https://data.commoncrawl.org/crawl-data/CC-MAIN-2025-47/segments/1762439343283.41/warc/CC-MAIN-20251112191244-20251112221244-00771.warc.gz
WARC Headers: WARC/1.0
WARC-Type: response
WARC-Date: 2025-11-12T19:15:28Z
WARC-Record-ID: <urn:uuid:476092f6-c855-4975-b25d-4333f5b185a5>
Content-Length: 21909
Content-Type: application/http; msgtype=response
WARC-Warcinfo-ID: <urn:uuid:8852e4f1-84c3-491b-a492-a7e0d0d40d23>
WARC-Concurrent-To: <urn:uuid:f406e731-1079-4ee3-8dcb-d1469869adbb>
WARC-IP-Address: 185.194.93.14
WARC-Target-URI: https://circl.lu/pub/tr-73/
WARC-Protocol: http/1.1
WARC-Protocol: tls/1.2
WARC-Cipher-Suite: TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384
WARC-Payload-Digest: sha1:M7PTL7JTFQUVNSLHOH6WRJ2WQDPF35FE
WARC-Block-Digest: sha1:2JLKAWYHP2BVGOF4JIOCNM3XIUYTIDCH
WARC-Identified-Payload-Type: text/html
-----------------------------------
HTTP Headers: HTTP/1.1 200 OK
Date: Wed, 12 Nov 2025 19:15:28 GMT
Server: Apache
Strict-Transport-Security: max-age=15768000
Last-Modified: Thu, 09 Mar 2023 09:33:07 GMT
ETag: "5298-5f67455177423-gzip"
Accept-Ranges: bytes
Vary: Accept-Encoding
X-Crawler-Content-Encoding: gzip
Content-Security-Policy: default-src 'self' 'unsafe-inline' 'unsafe-eval' circl.lu www.circl.lu www.gstatic.com pandora.circl.lu cra.circl.lu; img-src 'self' 'unsafe-inline' 'unsafe-eval' data: circl.lu www.circl.lu www.gstatic.com pandora.circl.lu cra.circl.lu;
X-Content-Type-Options: nosniff
X-Frame-Options: DENY
X-XSS-Protection: 1; mode=block;
X-Crawler-Content-Length: 6911
Content-Length: 21144
Keep-Alive: timeout=15, max=99
Connection: Keep-Alive
Content-Type: text/html
-----------------------------------
INFO: Saved payload to index.html
``
The option -i allows you to get informations of the file location and hash without downloading it
```bash
./ccwget-local.py https://circl.lu/pub/tr-73/ -i
INFO: Parsed URL -> scheme: https, host: circl.lu, path: /pub/tr-73/, query: None
INFO: Latest CCMAIN table: CCMAIN202547
------------------------------------------------------------
WARC file: crawl-data/CC-MAIN-2025-47/segments/1762439343283.41/warc/CC-MAIN-20251112191244-20251112221244-00771.warc.gz
WARC offset: 122010795
WARC length: 7731
Content digest: M7PTL7JTFQUVNSLHOH6WRJ2WQDPF35FE
Content languages: eng
Content MIME detected: text/html
Get files of a fqdn
You may get all the available files from a website using the -l ( list ) option
$ ./ccwget-local.py -l www.circl.lu
INFO: Latest CCMAIN table: CCMAIN202547
https://www.circl.lu/advisory/CVE-2017-14337/
https://www.circl.lu/projects/bgpranking/?utm_source=cybersectools.com
https://www.circl.lu/pub/tr-93/
https://www.circl.lu/pub/tr-93/de/
https://www.circl.lu/pub/tr-93/fr/
https://www.circl.lu/pub/tr-94/
https://www.circl.lu/pub/tr-95/
https://www.circl.lu/pub/tr-96/
You may ask for details for each files.