ail-framework/bin/modules/Onion.py

#!/usr/bin/env python3
# -*-coding:UTF-8 -*
"""
The Onion Module
============================

This module extract url from item and returning only ones which are tor
related (.onion). All These urls are send to the crawler discovery queue.

Requirements
------------

*Need running Redis instances. (Redis)

"""
import os
import sys
import re

sys.path.append(os.environ['AIL_BIN'])
##################################
# Import Project packages
##################################
from modules.abstract_module import AbstractModule
from lib.ConfigLoader import ConfigLoader
from lib.objects.Items import Item
from lib import crawlers

class Onion(AbstractModule):
    """docstring for Onion module."""

    def __init__(self):
        super(Onion, self).__init__()

        config_loader = ConfigLoader()
        self.r_cache = config_loader.get_redis_conn("Redis_Cache")

        self.pending_seconds = config_loader.get_config_int("Onion", "max_execution_time")
        # regex timeout
        self.regex_timeout = 30

        self.faup = crawlers.get_faup()

        # activate_crawler = p.config.get("Crawler", "activate_crawler")


        self.onion_regex = r"((http|https|ftp)?(?:\://)?([a-zA-Z0-9\.\-]+(\:[a-zA-Z0-9\.&%\$\-]+)*@)*((25[0-5]|2[0-4][0-9]|[0-1]{1}[0-9]{2}|[1-9]{1}[0-9]{1}|[1-9])\.(25[0-5]|2[0-4][0-9]|[0-1]{1}[0-9]{2}|[1-9]{1}[0-9]{1}|[1-9]|0)\.(25[0-5]|2[0-4][0-9]|[0-1]{1}[0-9]{2}|[1-9]{1}[0-9]{1}|[1-9]|0)\.(25[0-5]|2[0-4][0-9]|[0-1]{1}[0-9]{2}|[1-9]{1}[0-9]{1}|[0-9])|localhost|([a-zA-Z0-9\-]+\.)*[a-zA-Z0-9\-]+\.onion)(\:[0-9]+)*(/($|[a-zA-Z0-9\.\,\?\'\\\+&%\$#\=~_\-]+))*)"
        # self.i2p_regex = r"((http|https|ftp)?(?:\://)?([a-zA-Z0-9\.\-]+(\:[a-zA-Z0-9\.&%\$\-]+)*@)*((25[0-5]|2[0-4][0-9]|[0-1]{1}[0-9]{2}|[1-9]{1}[0-9]{1}|[1-9])\.(25[0-5]|2[0-4][0-9]|[0-1]{1}[0-9]{2}|[1-9]{1}[0-9]{1}|[1-9]|0)\.(25[0-5]|2[0-4][0-9]|[0-1]{1}[0-9]{2}|[1-9]{1}[0-9]{1}|[1-9]|0)\.(25[0-5]|2[0-4][0-9]|[0-1]{1}[0-9]{2}|[1-9]{1}[0-9]{1}|[0-9])|localhost|([a-zA-Z0-9\-]+\.)*[a-zA-Z0-9\-]+\.i2p)(\:[0-9]+)*(/($|[a-zA-Z0-9\.\,\?\'\\\+&%\$#\=~_\-]+))*)"
        re.compile(self.onion_regex)
        # re.compile(self.i2p_regex)

        self.redis_logger.info(f"Module: {self.module_name} Launched")

        # TEMP var: SAVE I2P Domain (future I2P crawler)
        # self.save_i2p = config_loader.get_config_boolean("Onion", "save_i2p")

    def extract(self, obj_id, content, tag):
        extracted = []
        onions = self.regex_finditer(self.onion_regex, obj_id, content)
        for onion in onions:
            start, end, value = onion
            url_unpack = crawlers.unpack_url(value)
            domain = url_unpack['domain']
            if crawlers.is_valid_onion_domain(domain):
                extracted.append([start, end, value, f'tag:{tag}'])
        return extracted

    def compute(self, message):
        onion_urls = []
        domains = []

        item_id, score = message.split()
        item = Item(item_id)
        item_content = item.get_content()

        # max execution time on regex
        res = self.regex_findall(self.onion_regex, item.get_id(), item_content)
        for x in res:
            # String to tuple
            x = x[2:-2].replace(" '", "").split("',")
            url = x[0]
            print(url)

            # TODO Crawl subdomain
            url_unpack = crawlers.unpack_url(url)
            domain = url_unpack['domain']
            if crawlers.is_valid_onion_domain(domain):
                domains.append(domain)
                onion_urls.append(url)

        if onion_urls:
            if crawlers.is_crawler_activated():
                for domain in domains:  # TODO LOAD DEFAULT SCREENSHOT + HAR
                    task_uuid = crawlers.create_task(domain, parent=item.get_id(), priority=0)
                    if task_uuid:
                        print(f'{domain} added to crawler queue: {task_uuid}')
            else:
                to_print = f'Onion;{item.get_source()};{item.get_date()};{item.get_basename()};'
                print(f'{to_print}Detected {len(domains)} .onion(s);{item.get_id()}')
                self.redis_logger.warning(f'{to_print}Detected {len(domains)} .onion(s);{item.get_id()}')

            # TAG Item
            msg = f'infoleak:automatic-detection="onion";{item.get_id()}'
            self.send_message_to_queue(msg, 'Tags')


if __name__ == "__main__":
    module = Onion()
    # module.compute('submitted/2022/10/10/submitted_705d1d92-7e9a-4a44-8c21-ccd167bfb7db.gz 9')
    module.run()
decode with redis connection 2018-05-04 11:53:29 +00:00			`#!/usr/bin/env python3`
Initial import of AIL framework - Analysis Information Leak framework AIL is a modular framework to analyse potential information leak from unstructured data source like pastes from Past ebin or similar services. AIL framework is flexible and can be extended to support other functionalities to mine sen sitive information 2014-08-06 09:43:40 +00:00			`# --coding:UTF-8 -`
			`"""`
chg: [ApiKey] refactor module + tests 2021-05-19 12:54:34 +00:00			`The Onion Module`
Initial import of AIL framework - Analysis Information Leak framework AIL is a modular framework to analyse potential information leak from unstructured data source like pastes from Past ebin or similar services. AIL framework is flexible and can be extended to support other functionalities to mine sen sitive information 2014-08-06 09:43:40 +00:00			`============================`

chg: [ApiKey] refactor module + tests 2021-05-19 12:54:34 +00:00			`This module extract url from item and returning only ones which are tor`
			`related (.onion). All These urls are send to the crawler discovery queue.`
Initial import of AIL framework - Analysis Information Leak framework AIL is a modular framework to analyse potential information leak from unstructured data source like pastes from Past ebin or similar services. AIL framework is flexible and can be extended to support other functionalities to mine sen sitive information 2014-08-06 09:43:40 +00:00
			`Requirements`
			`------------`

			`*Need running Redis instances. (Redis)`

			`"""`
The onion module now fetches the URLs it finds. 2014-08-31 20:42:12 +00:00			`import os`
chg: [AIL items + Onion] create AIL item objects + Onion module refactor 2021-05-14 12:42:16 +00:00			`import sys`
chg: [Crawler] change BDD, save i2p links 2018-08-21 13:54:53 +00:00			`import re`
Initial import of AIL framework - Analysis Information Leak framework AIL is a modular framework to analyse potential information leak from unstructured data source like pastes from Past ebin or similar services. AIL framework is flexible and can be extended to support other functionalities to mine sen sitive information 2014-08-06 09:43:40 +00:00
chg: [modules] create new modules repository + small fixs 2021-06-02 12:42:23 +00:00			`sys.path.append(os.environ['AIL_BIN'])`
			`##################################`
			`# Import Project packages`
			`##################################`
			`from modules.abstract_module import AbstractModule`
chg: [AIL items + Onion] create AIL item objects + Onion module refactor 2021-05-14 12:42:16 +00:00			`from lib.ConfigLoader import ConfigLoader`
chg: [crawler + core + cve] migrate crawler to lacus + add new CVE object and correlation + migrate core 2022-10-25 14:25:19 +00:00			`from lib.objects.Items import Item`
chg: [AIL items + Onion] create AIL item objects + Onion module refactor 2021-05-14 12:42:16 +00:00			`from lib import crawlers`

			`class Onion(AbstractModule):`
			`"""docstring for Onion module."""`

			`def __init__(self):`
			`super(Onion, self).__init__()`

			`config_loader = ConfigLoader()`
			`self.r_cache = config_loader.get_redis_conn("Redis_Cache")`

			`self.pending_seconds = config_loader.get_config_int("Onion", "max_execution_time")`
			`# regex timeout`
			`self.regex_timeout = 30`

			`self.faup = crawlers.get_faup()`

			`# activate_crawler = p.config.get("Crawler", "activate_crawler")`


chg: [crawler + core + cve] migrate crawler to lacus + add new CVE object and correlation + migrate core 2022-10-25 14:25:19 +00:00			`self.onion_regex = r"((http\|https\|ftp)?(?:\://)?([a-zA-Z0-9\.\-]+(\:[a-zA-Z0-9\.&%\$\-]+)@)((25[0-5]\|2[0-4][0-9]\|[0-1]{1}[0-9]{2}\|[1-9]{1}[0-9]{1}\|[1-9])\.(25[0-5]\|2[0-4][0-9]\|[0-1]{1}[0-9]{2}\|[1-9]{1}[0-9]{1}\|[1-9]\|0)\.(25[0-5]\|2[0-4][0-9]\|[0-1]{1}[0-9]{2}\|[1-9]{1}[0-9]{1}\|[1-9]\|0)\.(25[0-5]\|2[0-4][0-9]\|[0-1]{1}[0-9]{2}\|[1-9]{1}[0-9]{1}\|[0-9])\|localhost\|([a-zA-Z0-9\-]+\.)[a-zA-Z0-9\-]+\.onion)(\:[0-9]+)(/($\|[a-zA-Z0-9\.\,\?\'\\\+&%\$#\=~_\-]+))*)"`
			`# self.i2p_regex = r"((http\|https\|ftp)?(?:\://)?([a-zA-Z0-9\.\-]+(\:[a-zA-Z0-9\.&%\$\-]+)@)((25[0-5]\|2[0-4][0-9]\|[0-1]{1}[0-9]{2}\|[1-9]{1}[0-9]{1}\|[1-9])\.(25[0-5]\|2[0-4][0-9]\|[0-1]{1}[0-9]{2}\|[1-9]{1}[0-9]{1}\|[1-9]\|0)\.(25[0-5]\|2[0-4][0-9]\|[0-1]{1}[0-9]{2}\|[1-9]{1}[0-9]{1}\|[1-9]\|0)\.(25[0-5]\|2[0-4][0-9]\|[0-1]{1}[0-9]{2}\|[1-9]{1}[0-9]{1}\|[0-9])\|localhost\|([a-zA-Z0-9\-]+\.)[a-zA-Z0-9\-]+\.i2p)(\:[0-9]+)(/($\|[a-zA-Z0-9\.\,\?\'\\\+&%\$#\=~_\-]+))*)"`
			`re.compile(self.onion_regex)`
			`# re.compile(self.i2p_regex)`
chg: [AIL items + Onion] create AIL item objects + Onion module refactor 2021-05-14 12:42:16 +00:00
			`self.redis_logger.info(f"Module: {self.module_name} Launched")`

			`# TEMP var: SAVE I2P Domain (future I2P crawler)`
chg: [crawler + core + cve] migrate crawler to lacus + add new CVE object and correlation + migrate core 2022-10-25 14:25:19 +00:00			`# self.save_i2p = config_loader.get_config_boolean("Onion", "save_i2p")`
chg: [AIL items + Onion] create AIL item objects + Onion module refactor 2021-05-14 12:42:16 +00:00
chg: [DB Migration] UI: Extract + highlight leaks and trackers match, Data Retention save object first/last date, Refactor Tools 2022-12-19 15:38:20 +00:00			`def extract(self, obj_id, content, tag):`
			`extracted = []`
			`onions = self.regex_finditer(self.onion_regex, obj_id, content)`
			`for onion in onions:`
			`start, end, value = onion`
			`url_unpack = crawlers.unpack_url(value)`
			`domain = url_unpack['domain']`
			`if crawlers.is_valid_onion_domain(domain):`
chg: [extractor] add cache + UI extractor + word/set extractor 2023-02-23 15:25:15 +00:00			`extracted.append([start, end, value, f'tag:{tag}'])`
chg: [DB Migration] UI: Extract + highlight leaks and trackers match, Data Retention save object first/last date, Refactor Tools 2022-12-19 15:38:20 +00:00			`return extracted`

chg: [AIL items + Onion] create AIL item objects + Onion module refactor 2021-05-14 12:42:16 +00:00			`def compute(self, message):`
chg: [crawler + core + cve] migrate crawler to lacus + add new CVE object and correlation + migrate core 2022-10-25 14:25:19 +00:00			`onion_urls = []`
			`domains = []`
chg: [AIL items + Onion] create AIL item objects + Onion module refactor 2021-05-14 12:42:16 +00:00
chg: [crawler + core + cve] migrate crawler to lacus + add new CVE object and correlation + migrate core 2022-10-25 14:25:19 +00:00			`item_id, score = message.split()`
			`item = Item(item_id)`
chg: [AIL items + Onion] create AIL item objects + Onion module refactor 2021-05-14 12:42:16 +00:00			`item_content = item.get_content()`

			`# max execution time on regex`
chg: [crawler + core + cve] migrate crawler to lacus + add new CVE object and correlation + migrate core 2022-10-25 14:25:19 +00:00			`res = self.regex_findall(self.onion_regex, item.get_id(), item_content)`
chg: [AIL items + Onion] create AIL item objects + Onion module refactor 2021-05-14 12:42:16 +00:00			`for x in res:`
			`# String to tuple`
			`x = x[2:-2].replace(" '", "").split("',")`
			`url = x[0]`
chg: [crawler + core + cve] migrate crawler to lacus + add new CVE object and correlation + migrate core 2022-10-25 14:25:19 +00:00			`print(url)`
chg: [AIL items + Onion] create AIL item objects + Onion module refactor 2021-05-14 12:42:16 +00:00
chg: [crawler + core + cve] migrate crawler to lacus + add new CVE object and correlation + migrate core 2022-10-25 14:25:19 +00:00			`# TODO Crawl subdomain`
			`url_unpack = crawlers.unpack_url(url)`
			`domain = url_unpack['domain']`
chg: [AIL items + Onion] create AIL item objects + Onion module refactor 2021-05-14 12:42:16 +00:00			`if crawlers.is_valid_onion_domain(domain):`
chg: [crawler + core + cve] migrate crawler to lacus + add new CVE object and correlation + migrate core 2022-10-25 14:25:19 +00:00			`domains.append(domain)`
			`onion_urls.append(url)`

			`if onion_urls:`
			`if crawlers.is_crawler_activated():`
chg: [cleanup] remove ARDB + fix hive case 2023-01-18 15:28:08 +00:00			`for domain in domains: # TODO LOAD DEFAULT SCREENSHOT + HAR`
chg: [crawler] refactor crawler tasks + migrate cookiejars + add proxy option 2023-02-21 11:22:49 +00:00			`task_uuid = crawlers.create_task(domain, parent=item.get_id(), priority=0)`
chg: [crawler + core + cve] migrate crawler to lacus + add new CVE object and correlation + migrate core 2022-10-25 14:25:19 +00:00			`if task_uuid:`
			`print(f'{domain} added to crawler queue: {task_uuid}')`
			`else:`
			`to_print = f'Onion;{item.get_source()};{item.get_date()};{item.get_basename()};'`
			`print(f'{to_print}Detected {len(domains)} .onion(s);{item.get_id()}')`
			`self.redis_logger.warning(f'{to_print}Detected {len(domains)} .onion(s);{item.get_id()}')`

			`# TAG Item`
			`msg = f'infoleak:automatic-detection="onion";{item.get_id()}'`
			`self.send_message_to_queue(msg, 'Tags')`
The onion module now fetches the URLs it finds. 2014-08-31 20:42:12 +00:00
chg: [AIL items + Onion] create AIL item objects + Onion module refactor 2021-05-14 12:42:16 +00:00
chg: [crawler + core + cve] migrate crawler to lacus + add new CVE object and correlation + migrate core 2022-10-25 14:25:19 +00:00			`if __name__ == "__main__":`
chg: [AIL items + Onion] create AIL item objects + Onion module refactor 2021-05-14 12:42:16 +00:00			`module = Onion()`
chg: [crawler + core + cve] migrate crawler to lacus + add new CVE object and correlation + migrate core 2022-10-25 14:25:19 +00:00			`# module.compute('submitted/2022/10/10/submitted_705d1d92-7e9a-4a44-8c21-ccd167bfb7db.gz 9')`
chg: [AIL items + Onion] create AIL item objects + Onion module refactor 2021-05-14 12:42:16 +00:00			`module.run()`