Ingest Common Crawl index in ClickHouse

Python 100%

Find a file

Paul Jung (Thanat0s) efa36ca3f4 Add capacity to find the new indexes		2026-01-19 08:59:20 +01:00
.gitignore	Initial commit	2025-10-15 09:51:39 +00:00
ccwget-local.py	fix some switch, add sha1	2025-12-28 18:50:09 +01:00
ccwget.py	ccwget fetch from crb	2025-12-18 20:57:34 +00:00
config.yaml.sample	Add automagic export for FQDN	2026-01-16 17:28:44 +01:00
create_idx.py	Add capacity to find the new indexes	2026-01-19 08:59:20 +01:00
export_ccmain_fqdn.py	Add automagic export for FQDN	2026-01-16 17:28:44 +01:00
getlatest.py	Add capacity to find the new indexes	2026-01-19 08:59:20 +01:00
ingest.py	Add capacity to find the new indexes	2026-01-19 08:59:20 +01:00
LICENSE	Initial commit	2025-10-15 09:51:39 +00:00
live_record.py	initial commit	2025-10-15 18:24:54 +02:00
README.md	start documentation	2025-12-20 09:59:05 +01:00
requirements.txt	update requirement	2025-12-20 09:59:49 +01:00
svc_ccwget.py	ccwget fetch from crb	2025-12-18 20:57:34 +00:00

README.md

CommonCrawl-Ingestor

###########   XXXX    ##########      ___ 
####### XXXXXXXXXXXXXXXX #######    / ___|___  _ __ ___  _ __ ___   ___  _ __  
#### XXXXXXXXXXXXXXXXXXXXXX ####   | |   / _ \| '_ ` _ \| '_ ` _ \ / _ \| '_ \ 
#####  XXXXXXXXXXXXXXXXXXXX ####   | |__| (_) | | | | | | | | | | | (_) | | | |
#####     XXXXXXXXXXXXXXXX #####    \____\___/|_| |_| |_|_| |_| |_|\___/|_| |_|
####          XXXXXXXXXXX   ####          / ___|_ __ __ ___      _| |   
XXXXXXXXXXX     XXXXXXX      X #         | |   | '__/ _` \ \ /\ / / |
XXXXXXXXXXX         XX    XXXXX#         | |___| | | (_| |\ V  V /| |
XXXXXXXXXXXX             XXXXXXX          \____|_|  \__,_| \_/\_/ |_|  _  
XXXXXXXXXXX        XXXXXXXXXXXX#                  __      __/ ___| ___| |_    
XXXXXXXXXXX    XXXXXXXXXXXXXXX #                  \ \ /\ / / |  _ / _ \ __|
# XXXXXXXXXX  XXXXXXXXXXXXXXXXX#                   \ V  V /| |_| |  __/ |_ 
## XXXXXXXXX   XXXXXXXXXXXXXXX##                    \_/\_/  \____|\___|\__|  
### XXXXXXXX     XXXXXXXXXXX ###                                               
##### XXXXXX      XXXXXXXX #####                                       circl.lu
######## XXX ##### XXXX  #######

CCWGET

Introduction

CCWGET is the cli client for the live Common Crawl API index hosted at CIRCL. It behave and stick to the basic same commmands and behaviour that a simple wget. However this tool is expanded with capabilities to use the data of Common Crawl indexes.

## Features CCWGET allows you to ;

Download a web page from common crawl database
Get a list of pages available for a given fqdn or domain.
Search and enumerate Domains
Search for a SHA1 hash of web page body.

### To download a file

Simply request as you will do with wget. You may use -O as output redirector.

$ ./ccwget-local.py  https://circl.lu/pub/tr-73/
INFO: Parsed URL -> scheme: https, host: circl.lu, path: /pub/tr-73/, query: None
INFO: Latest CCMAIN table: CCMAIN202547
INFO: Fetching WARC segment: https://data.commoncrawl.org/crawl-data/CC-MAIN-2025-47/segments/1762439343283.41/warc/CC-MAIN-20251112191244-20251112221244-00771.warc.gz
INFO: Saved payload to index.html

Note:

If the resource is not present in common crawl database the following behaviour occuers
If the Scheme is not requested, the tool assume https://

$ ./ccwget-local.py  https://circl.lu/pub/tr-44/
INFO: Parsed URL -> scheme: https, host: circl.lu, path: /pub/tr-44/, query: None
INFO: Latest CCMAIN table: CCMAIN202547
No matching records found.

You may display the Common Crawl WARC header and file header using the '-S' option.

$ ./ccwget-local.py  https://circl.lu/pub/tr-73/ -S
INFO: Parsed URL -> scheme: https, host: circl.lu, path: /pub/tr-73/, query: None
INFO: Latest CCMAIN table: CCMAIN202547
INFO: Fetching WARC segment: https://data.commoncrawl.org/crawl-data/CC-MAIN-2025-47/segments/1762439343283.41/warc/CC-MAIN-20251112191244-20251112221244-00771.warc.gz
WARC Headers: WARC/1.0
WARC-Type: response
WARC-Date: 2025-11-12T19:15:28Z
WARC-Record-ID: <urn:uuid:476092f6-c855-4975-b25d-4333f5b185a5>
Content-Length: 21909
Content-Type: application/http; msgtype=response
WARC-Warcinfo-ID: <urn:uuid:8852e4f1-84c3-491b-a492-a7e0d0d40d23>
WARC-Concurrent-To: <urn:uuid:f406e731-1079-4ee3-8dcb-d1469869adbb>
WARC-IP-Address: 185.194.93.14
WARC-Target-URI: https://circl.lu/pub/tr-73/
WARC-Protocol: http/1.1
WARC-Protocol: tls/1.2
WARC-Cipher-Suite: TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384
WARC-Payload-Digest: sha1:M7PTL7JTFQUVNSLHOH6WRJ2WQDPF35FE
WARC-Block-Digest: sha1:2JLKAWYHP2BVGOF4JIOCNM3XIUYTIDCH
WARC-Identified-Payload-Type: text/html

-----------------------------------
HTTP Headers: HTTP/1.1 200 OK
Date: Wed, 12 Nov 2025 19:15:28 GMT
Server: Apache
Strict-Transport-Security: max-age=15768000
Last-Modified: Thu, 09 Mar 2023 09:33:07 GMT
ETag: "5298-5f67455177423-gzip"
Accept-Ranges: bytes
Vary: Accept-Encoding
X-Crawler-Content-Encoding: gzip
Content-Security-Policy: default-src 'self' 'unsafe-inline' 'unsafe-eval' circl.lu www.circl.lu www.gstatic.com pandora.circl.lu cra.circl.lu; img-src 'self' 'unsafe-inline' 'unsafe-eval' data: circl.lu www.circl.lu www.gstatic.com pandora.circl.lu cra.circl.lu;
X-Content-Type-Options: nosniff
X-Frame-Options: DENY
X-XSS-Protection: 1; mode=block;
X-Crawler-Content-Length: 6911
Content-Length: 21144
Keep-Alive: timeout=15, max=99
Connection: Keep-Alive
Content-Type: text/html

-----------------------------------
INFO: Saved payload to index.html
``

The option -i allows you to get informations of the file location and hash without downloading it 

```bash
./ccwget-local.py  https://circl.lu/pub/tr-73/ -i
INFO: Parsed URL -> scheme: https, host: circl.lu, path: /pub/tr-73/, query: None
INFO: Latest CCMAIN table: CCMAIN202547
------------------------------------------------------------
WARC file: crawl-data/CC-MAIN-2025-47/segments/1762439343283.41/warc/CC-MAIN-20251112191244-20251112221244-00771.warc.gz
WARC offset: 122010795
WARC length: 7731
Content digest: M7PTL7JTFQUVNSLHOH6WRJ2WQDPF35FE
Content languages: eng
Content MIME detected: text/html

Get files of a fqdn

You may get all the available files from a website using the -l ( list ) option

$ ./ccwget-local.py -l www.circl.lu
INFO: Latest CCMAIN table: CCMAIN202547
https://www.circl.lu/advisory/CVE-2017-14337/
https://www.circl.lu/projects/bgpranking/?utm_source=cybersectools.com
https://www.circl.lu/pub/tr-93/
https://www.circl.lu/pub/tr-93/de/
https://www.circl.lu/pub/tr-93/fr/
https://www.circl.lu/pub/tr-94/
https://www.circl.lu/pub/tr-95/
https://www.circl.lu/pub/tr-96/

You may ask for details for each files.