diff --git a/HOWTO.md b/HOWTO.md index 3978408f..855f3d54 100644 --- a/HOWTO.md +++ b/HOWTO.md @@ -89,76 +89,34 @@ Also, you can quickly stop or start modules by clicking on the ```` or ``` Finally, you can quit this program by pressing either ```` or ````. -Terms frequency usage ---------------------- - -In AIL, you can track terms, set of terms and even regexes without creating a dedicated module. To do so, go to the tab `Terms Frequency` in the web interface. -- You can track a term by simply putting it in the box. -- You can track a set of terms by simply putting terms in an array surrounded by the '\' character. You can also set a custom threshold regarding the number of terms that must match to trigger the detection. For example, if you want to track the terms _term1_ and _term2_ at the same time, you can use the following rule: `\[term1, term2, [100]]\` -- You can track regexes as easily as tracking a term. You just have to put your regex in the box surrounded by the '/' character. For example, if you want to track the regex matching all email address having the domain _domain.net_, you can use the following aggressive rule: `/*.domain.net/`. - Crawler --------------------- In AIL, you can crawl Tor hidden services. Don't forget to review the proxy configuration of your Tor client and especially if you enabled the SOCKS5 proxy and binding on the appropriate IP address reachable via the dockers where Splash runs. -There are two types of installation. You can install a *local* or a *remote* Splash server. -``(Splash host) = the server running the splash service`` -``(AIL host) = the server running AIL`` - -### Installation/Configuration - -1. *(Splash host)* Launch ``crawler_hidden_services_install.sh`` to install all requirements (type ``y`` if a localhost splash server is used or use the ``-y`` option) - -2. *(Splash host)* To install and setup your tor proxy: - - Install the tor proxy: ``sudo apt-get install tor -y`` - (Not required if ``Splash host == AIL host`` - The tor proxy is installed by default in AIL) - - (Warning: Some v3 onion address are not resolved with the tor proxy provided via apt get. Use the tor proxy provided by [The torproject](https://2019.www.torproject.org/docs/debian) to solve this issue) - - Allow Tor to bind to any interface or to the docker interface (by default binds to 127.0.0.1 only) in ``/etc/tor/torrc`` - ``SOCKSPort 0.0.0.0:9050`` or - ``SOCKSPort 172.17.0.1:9050`` - - Add the following line ``SOCKSPolicy accept 172.17.0.0/16`` in ``/etc/tor/torrc`` - (for a linux docker, the localhost IP is *172.17.0.1*; Should be adapted for other platform) - - Restart the tor proxy: ``sudo service tor restart`` - -3. *(AIL host)* Edit the ``/configs/core.cfg`` file: - - In the crawler section, set ``activate_crawler`` to ``True`` - - Change the IP address of Splash servers if needed (remote only) - - Set ``splash_onion_port`` according to your Splash servers port numbers that will be used. - those ports numbers should be described as a single port (ex: 8050) or a port range (ex: 8050-8052 for 8050,8051,8052 ports). +### Installation -### Starting the scripts +[Install AIL-Splash-Manager](https://github.com/ail-project/ail-splash-manager) -- *(Splash host)* Launch all Splash servers with: -```sudo ./bin/torcrawler/launch_splash_crawler.sh -f -p -n ``` -With ```` and ```` matching those specified at ``splash_onion_port`` in the configuration file of point 3 (``/configs/core.cfg``) +### Configuration -All Splash dockers are launched inside the ``Docker_Splash`` screen. You can use ``sudo screen -r Docker_Splash`` to connect to the screen session and check all Splash servers status. - -- (AIL host) launch all AIL crawler scripts using: -```./bin/LAUNCH.sh -c``` +1. Search the Splash-Manager API key. This API key is generated when you launch the manager for the first time. +(located in your Splash Manager directory ``ail-splash-manager/token_admin.txt``) -### TL;DR - Local setup +2. Splash Manager URL and API Key: +In the webinterface, go to ``Crawlers>Settings`` and click on the Edit button +![Splash Manager Config](./doc/screenshots/splash_manager_config_edit_1.png?raw=true "AIL framework Splash Manager Config") -#### Installation -- ```crawler_hidden_services_install.sh -y``` -- Add the following line in ``SOCKSPolicy accept 172.17.0.0/16`` in ``/etc/tor/torrc`` -- ```sudo service tor restart``` -- set activate_crawler to True in ``/configs/core.cfg`` -#### Start -- ```sudo ./bin/torcrawler/launch_splash_crawler.sh -f $AIL_HOME/configs/docker/splash_onion/etc/splash/proxy-profiles/ -p 8050 -n 1``` +![Splash Manager Config](./doc/screenshots/splash_manager_config_edit_2.png?raw=true "AIL framework Splash Manager Config") -If AIL framework is not started, it's required to start it before the crawler service: +3. Launch AIL Crawlers: +Choose the number of crawlers you want to launch +![Splash Manager Nb Crawlers Config](./doc/screenshots/splash_manager_nb_crawlers_1.png?raw=true "AIL framework Nb Crawlers Config") +![Splash Manager Nb Crawlers Config](./doc/screenshots/splash_manager_nb_crawlers_2.png?raw=true "AIL framework Nb Crawlers Config") -- ```./bin/LAUNCH.sh -l``` - -Then starting the crawler service (if you follow the procedure above) - -- ```./bin/LAUNCH.sh -c``` #### Old updates diff --git a/OVERVIEW.md b/OVERVIEW.md index 316942cb..5790acd9 100644 --- a/OVERVIEW.md +++ b/OVERVIEW.md @@ -420,6 +420,33 @@ Supported cryptocurrency: } ``` +### Splash containers and proxies: +| SET - Key | Value | +| ------ | ------ | +| all_proxy | **proxy name** | +| all_splash | **splash name** | + +| HSET - Key | Field | Value | +| ------ | ------ | ------ | +| proxy:metadata:**proxy name** | host | **host** | +| proxy:metadata:**proxy name** | port | **port** | +| proxy:metadata:**proxy name** | type | **type** | +| proxy:metadata:**proxy name** | crawler_type | **crawler_type** | +| proxy:metadata:**proxy name** | description | **proxy description** | +| | | | +| splash:metadata:**splash name** | description | **splash description** | +| splash:metadata:**splash name** | crawler_type | **crawler_type** | +| splash:metadata:**splash name** | proxy | **splash proxy (None if null)** | + +| SET - Key | Value | +| ------ | ------ | +| splash:url:**container name** | **splash url** | +| proxy:splash:**proxy name** | **container name** | + +| Key | Value | +| ------ | ------ | +| splash:map:url:name:**splash url** | **container name** | + ##### CRAWLER QUEUES: | SET - Key | Value | | ------ | ------ | diff --git a/bin/Crawler.py b/bin/Crawler.py index 4d745aad..b12d0f11 100755 --- a/bin/Crawler.py +++ b/bin/Crawler.py @@ -19,6 +19,9 @@ sys.path.append(os.environ['AIL_BIN']) from Helper import Process from pubsublogger import publisher +sys.path.append(os.path.join(os.environ['AIL_BIN'], 'lib')) +import crawlers + # ======== FUNCTIONS ======== def load_blacklist(service_type): @@ -117,43 +120,6 @@ def unpack_url(url): return to_crawl -# get url, paste and service_type to crawl -def get_elem_to_crawl(rotation_mode): - message = None - domain_service_type = None - - #load_priority_queue - for service_type in rotation_mode: - message = redis_crawler.spop('{}_crawler_priority_queue'.format(service_type)) - if message is not None: - domain_service_type = service_type - break - #load_discovery_queue - if message is None: - for service_type in rotation_mode: - message = redis_crawler.spop('{}_crawler_discovery_queue'.format(service_type)) - if message is not None: - domain_service_type = service_type - break - #load_normal_queue - if message is None: - for service_type in rotation_mode: - message = redis_crawler.spop('{}_crawler_queue'.format(service_type)) - if message is not None: - domain_service_type = service_type - break - - if message: - splitted = message.rsplit(';', 1) - if len(splitted) == 2: - url, paste = splitted - if paste: - paste = paste.replace(PASTES_FOLDER+'/', '') - - message = {'url': url, 'paste': paste, 'type_service': domain_service_type, 'original_message': message} - - return message - def get_crawler_config(redis_server, mode, service_type, domain, url=None): crawler_options = {} if mode=='auto': @@ -175,14 +141,17 @@ def get_crawler_config(redis_server, mode, service_type, domain, url=None): redis_server.delete('crawler_config:{}:{}:{}'.format(mode, service_type, domain)) return crawler_options -def load_crawler_config(service_type, domain, paste, url, date): +def load_crawler_config(queue_type, service_type, domain, paste, url, date): crawler_config = {} - crawler_config['splash_url'] = splash_url + crawler_config['splash_url'] = f'http://{splash_url}' crawler_config['item'] = paste crawler_config['service_type'] = service_type crawler_config['domain'] = domain crawler_config['date'] = date + if queue_type and queue_type != 'tor': + service_type = queue_type + # Auto and Manual Crawling # Auto ################################################# create new entry, next crawling => here or when ended ? if paste == 'auto': @@ -224,26 +193,29 @@ def crawl_onion(url, domain, port, type_service, message, crawler_config): crawler_config['port'] = port print('Launching Crawler: {}'.format(url)) - r_cache.hset('metadata_crawler:{}'.format(splash_port), 'crawling_domain', domain) - r_cache.hset('metadata_crawler:{}'.format(splash_port), 'started_time', datetime.datetime.now().strftime("%Y/%m/%d - %H:%M.%S")) + r_cache.hset('metadata_crawler:{}'.format(splash_url), 'crawling_domain', domain) + r_cache.hset('metadata_crawler:{}'.format(splash_url), 'started_time', datetime.datetime.now().strftime("%Y/%m/%d - %H:%M.%S")) retry = True nb_retry = 0 while retry: try: - r = requests.get(splash_url , timeout=30.0) + r = requests.get(f'http://{splash_url}' , timeout=30.0) retry = False except Exception: # TODO: relaunch docker or send error message nb_retry += 1 + if nb_retry == 2: + crawlers.restart_splash_docker(splash_url, splash_name) + if nb_retry == 6: on_error_send_message_back_in_queue(type_service, domain, message) publisher.error('{} SPASH DOWN'.format(splash_url)) print('--------------------------------------') print(' \033[91m DOCKER SPLASH DOWN\033[0m') print(' {} DOWN'.format(splash_url)) - r_cache.hset('metadata_crawler:{}'.format(splash_port), 'status', 'SPLASH DOWN') + r_cache.hset('metadata_crawler:{}'.format(splash_url), 'status', 'SPLASH DOWN') nb_retry == 0 print(' \033[91m DOCKER SPLASH NOT AVAILABLE\033[0m') @@ -251,7 +223,7 @@ def crawl_onion(url, domain, port, type_service, message, crawler_config): time.sleep(10) if r.status_code == 200: - r_cache.hset('metadata_crawler:{}'.format(splash_port), 'status', 'Crawling') + r_cache.hset('metadata_crawler:{}'.format(splash_url), 'status', 'Crawling') # save config in cash UUID = str(uuid.uuid4()) r_cache.set('crawler_request:{}'.format(UUID), json.dumps(crawler_config)) @@ -273,8 +245,10 @@ def crawl_onion(url, domain, port, type_service, message, crawler_config): print('') print(' PROXY DOWN OR BAD CONFIGURATION\033[0m'.format(splash_url)) print('------------------------------------------------------------------------') - r_cache.hset('metadata_crawler:{}'.format(splash_port), 'status', 'Error') + r_cache.hset('metadata_crawler:{}'.format(splash_url), 'status', 'Error') exit(-2) + else: + crawlers.update_splash_manager_connection_status(True) else: print(process.stdout.read()) exit(-1) @@ -283,7 +257,7 @@ def crawl_onion(url, domain, port, type_service, message, crawler_config): print('--------------------------------------') print(' \033[91m DOCKER SPLASH DOWN\033[0m') print(' {} DOWN'.format(splash_url)) - r_cache.hset('metadata_crawler:{}'.format(splash_port), 'status', 'Crawling') + r_cache.hset('metadata_crawler:{}'.format(splash_url), 'status', 'Crawling') exit(1) # check external links (full_crawl) @@ -305,13 +279,27 @@ def search_potential_source_domain(type_service, domain): if __name__ == '__main__': if len(sys.argv) != 2: - print('usage:', 'Crawler.py', 'splash_port') + print('usage:', 'Crawler.py', 'splash_url') exit(1) ################################################## - #mode = sys.argv[1] - splash_port = sys.argv[1] + splash_url = sys.argv[1] + + splash_name = crawlers.get_splash_name_by_url(splash_url) + proxy_name = crawlers.get_splash_proxy(splash_name) + crawler_type = crawlers.get_splash_crawler_type(splash_name) + + print(f'SPLASH Name: {splash_name}') + print(f'Proxy Name: {proxy_name}') + print(f'Crawler Type: {crawler_type}') + + #time.sleep(10) + #sys.exit(0) + + #rotation_mode = deque(['onion', 'regular']) + all_crawler_queues = crawlers.get_crawler_queue_types_by_splash_name(splash_name) + rotation_mode = deque(all_crawler_queues) + print(rotation_mode) - rotation_mode = deque(['onion', 'regular']) default_proto_map = {'http': 80, 'https': 443} ######################################################## add ftp ??? @@ -323,7 +311,6 @@ if __name__ == '__main__': # Setup the I/O queues p = Process(config_section) - splash_url = '{}:{}'.format( p.config.get("Crawler", "splash_url"), splash_port) print('splash url: {}'.format(splash_url)) PASTES_FOLDER = os.path.join(os.environ['AIL_HOME'], p.config.get("Directories", "pastes")) @@ -346,7 +333,7 @@ if __name__ == '__main__': db=p.config.getint("ARDB_Onion", "db"), decode_responses=True) - faup = Faup() + faup = crawlers.get_faup() # get HAR files default_crawler_har = p.config.getboolean("Crawler", "default_crawler_har") @@ -372,9 +359,9 @@ if __name__ == '__main__': 'user_agent': p.config.get("Crawler", "default_crawler_user_agent")} # Track launched crawler - r_cache.sadd('all_crawler', splash_port) - r_cache.hset('metadata_crawler:{}'.format(splash_port), 'status', 'Waiting') - r_cache.hset('metadata_crawler:{}'.format(splash_port), 'started_time', datetime.datetime.now().strftime("%Y/%m/%d - %H:%M.%S")) + r_cache.sadd('all_splash_crawlers', splash_url) + r_cache.hset('metadata_crawler:{}'.format(splash_url), 'status', 'Waiting') + r_cache.hset('metadata_crawler:{}'.format(splash_url), 'started_time', datetime.datetime.now().strftime("%Y/%m/%d - %H:%M.%S")) # update hardcoded blacklist load_blacklist('onion') @@ -385,7 +372,7 @@ if __name__ == '__main__': update_auto_crawler() rotation_mode.rotate() - to_crawl = get_elem_to_crawl(rotation_mode) + to_crawl = crawlers.get_elem_to_crawl_by_queue_type(rotation_mode) if to_crawl: url_data = unpack_url(to_crawl['url']) # remove domain from queue @@ -408,9 +395,9 @@ if __name__ == '__main__': 'epoch': int(time.time())} # Update crawler status type - r_cache.sadd('{}_crawlers'.format(to_crawl['type_service']), splash_port) + r_cache.hset('metadata_crawler:{}'.format(splash_url), 'type', to_crawl['type_service']) - crawler_config = load_crawler_config(to_crawl['type_service'], url_data['domain'], to_crawl['paste'], to_crawl['url'], date) + crawler_config = load_crawler_config(to_crawl['queue_type'], to_crawl['type_service'], url_data['domain'], to_crawl['paste'], to_crawl['url'], date) # check if default crawler if not crawler_config['requested']: # Auto crawl only if service not up this month @@ -456,11 +443,11 @@ if __name__ == '__main__': redis_crawler.ltrim('last_{}'.format(to_crawl['type_service']), 0, 15) #update crawler status - r_cache.hset('metadata_crawler:{}'.format(splash_port), 'status', 'Waiting') - r_cache.hdel('metadata_crawler:{}'.format(splash_port), 'crawling_domain') + r_cache.hset('metadata_crawler:{}'.format(splash_url), 'status', 'Waiting') + r_cache.hdel('metadata_crawler:{}'.format(splash_url), 'crawling_domain') # Update crawler status type - r_cache.srem('{}_crawlers'.format(to_crawl['type_service']), splash_port) + r_cache.hdel('metadata_crawler:{}'.format(splash_url), 'type', to_crawl['type_service']) # add next auto Crawling in queue: if to_crawl['paste'] == 'auto': diff --git a/bin/LAUNCH.sh b/bin/LAUNCH.sh index c4e4a538..e231966c 100755 --- a/bin/LAUNCH.sh +++ b/bin/LAUNCH.sh @@ -150,6 +150,8 @@ function launching_scripts { # LAUNCH CORE MODULE screen -S "Script_AIL" -X screen -t "JSON_importer" bash -c "cd ${AIL_BIN}/import; ${ENV_PY} ./JSON_importer.py; read x" sleep 0.1 + screen -S "Script_AIL" -X screen -t "Crawler_manager" bash -c "cd ${AIL_BIN}/core; ${ENV_PY} ./Crawler_manager.py; read x" + sleep 0.1 screen -S "Script_AIL" -X screen -t "ModuleInformation" bash -c "cd ${AIL_BIN}; ${ENV_PY} ./ModulesInformationV2.py -k 0 -c 1; read x" @@ -198,8 +200,8 @@ function launching_scripts { sleep 0.1 screen -S "Script_AIL" -X screen -t "Tools" bash -c "cd ${AIL_BIN}; ${ENV_PY} ./Tools.py; read x" sleep 0.1 - screen -S "Script_AIL" -X screen -t "Phone" bash -c "cd ${AIL_BIN}; ${ENV_PY} ./Phone.py; read x" - sleep 0.1 + #screen -S "Script_AIL" -X screen -t "Phone" bash -c "cd ${AIL_BIN}; ${ENV_PY} ./Phone.py; read x" + #sleep 0.1 #screen -S "Script_AIL" -X screen -t "Release" bash -c "cd ${AIL_BIN}; ${ENV_PY} ./Release.py; read x" #sleep 0.1 screen -S "Script_AIL" -X screen -t "Cve" bash -c "cd ${AIL_BIN}; ${ENV_PY} ./Cve.py; read x" diff --git a/bin/core/Crawler_manager.py b/bin/core/Crawler_manager.py new file mode 100755 index 00000000..3a95e706 --- /dev/null +++ b/bin/core/Crawler_manager.py @@ -0,0 +1,66 @@ +#!/usr/bin/env python3 +# -*-coding:UTF-8 -* + +import os +import sys +import time + +sys.path.append(os.path.join(os.environ['AIL_BIN'], 'lib')) +import ConfigLoader +import crawlers + +config_loader = ConfigLoader.ConfigLoader() +r_serv_metadata = config_loader.get_redis_conn("ARDB_Metadata") +config_loader = None + +# # TODO: lauch me in core screen +# # TODO: check if already launched in tor screen + +# # TODO: handle mutltiple splash_manager +if __name__ == '__main__': + + is_manager_connected = crawlers.ping_splash_manager() + if not is_manager_connected: + print('Error, Can\'t connect to Splash manager') + session_uuid = None + else: + print('Splash manager connected') + session_uuid = crawlers.get_splash_manager_session_uuid() + is_manager_connected = crawlers.reload_splash_and_proxies_list() + print(is_manager_connected) + if is_manager_connected: + crawlers.relaunch_crawlers() + last_check = int(time.time()) + + while True: + + # # TODO: avoid multiple ping + + # check if manager is connected + if int(time.time()) - last_check > 60: + is_manager_connected = crawlers.is_splash_manager_connected() + current_session_uuid = crawlers.get_splash_manager_session_uuid() + # reload proxy and splash list + if current_session_uuid and current_session_uuid != session_uuid: + is_manager_connected = crawlers.reload_splash_and_proxies_list() + if is_manager_connected: + print('reload proxies and splash list') + crawlers.relaunch_crawlers() + session_uuid = current_session_uuid + if not is_manager_connected: + print('Error, Can\'t connect to Splash manager') + last_check = int(time.time()) + + # # TODO: lauch crawlers if was never connected + # refresh splash and proxy list + elif False: + crawlers.reload_splash_and_proxies_list() + print('list of splash and proxies refreshed') + else: + time.sleep(5) + + # kill/launch new crawler / crawler manager check if already launched + + + # # TODO: handle mutltiple splash_manager + # catch reload request diff --git a/bin/core/screen.py b/bin/core/screen.py index bc6ebdb2..8b65daa4 100755 --- a/bin/core/screen.py +++ b/bin/core/screen.py @@ -4,6 +4,7 @@ import os import subprocess import sys +import re all_screen_name = set() @@ -16,8 +17,11 @@ def is_screen_install(): print(p.stderr) return False -def exist_screen(screen_name): - cmd_1 = ['screen', '-ls'] +def exist_screen(screen_name, with_sudoer=False): + if with_sudoer: + cmd_1 = ['sudo', 'screen', '-ls'] + else: + cmd_1 = ['screen', '-ls'] cmd_2 = ['egrep', '[0-9]+.{}'.format(screen_name)] p1 = subprocess.Popen(cmd_1, stdout=subprocess.PIPE) p2 = subprocess.Popen(cmd_2, stdin=p1.stdout, stdout=subprocess.PIPE) @@ -27,6 +31,36 @@ def exist_screen(screen_name): return True return False +def get_screen_pid(screen_name, with_sudoer=False): + if with_sudoer: + cmd_1 = ['sudo', 'screen', '-ls'] + else: + cmd_1 = ['screen', '-ls'] + cmd_2 = ['egrep', '[0-9]+.{}'.format(screen_name)] + p1 = subprocess.Popen(cmd_1, stdout=subprocess.PIPE) + p2 = subprocess.Popen(cmd_2, stdin=p1.stdout, stdout=subprocess.PIPE) + p1.stdout.close() # Allow p1 to receive a SIGPIPE if p2 exits. + output = p2.communicate()[0] + if output: + # extract pids with screen name + regex_pid_screen_name = b'[0-9]+.' + screen_name.encode() + pids = re.findall(regex_pid_screen_name, output) + # extract pids + all_pids = [] + for pid_name in pids: + pid = pid_name.split(b'.')[0].decode() + all_pids.append(pid) + return all_pids + return [] + +def detach_screen(screen_name): + cmd = ['screen', '-d', screen_name] + p = subprocess.run(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE) + #if p.stdout: + # print(p.stdout) + if p.stderr: + print(p.stderr) + def create_screen(screen_name): if not exist_screen(screen_name): cmd = ['screen', '-dmS', screen_name] @@ -38,18 +72,59 @@ def create_screen(screen_name): print(p.stderr) return False +def kill_screen(screen_name, with_sudoer=False): + if get_screen_pid(screen_name, with_sudoer=with_sudoer): + for pid in get_screen_pid(screen_name, with_sudoer=with_sudoer): + cmd = ['kill', pid] + p = subprocess.run(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE) + if p.stderr: + print(p.stderr) + else: + print('{} killed'.format(pid)) + return True + return False + # # TODO: add check if len(window_name) == 20 # use: screen -S 'pid.screen_name' -p %window_id% -Q title # if len(windows_name) > 20 (truncated by default) -def get_screen_windows_list(screen_name): +def get_screen_windows_list(screen_name, r_set=True): + # detach screen to avoid incomplete result + detach_screen(screen_name) + if r_set: + all_windows_name = set() + else: + all_windows_name = [] cmd = ['screen', '-S', screen_name, '-Q', 'windows'] p = subprocess.run(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE) if p.stdout: for window_row in p.stdout.split(b' '): window_id, window_name = window_row.decode().split() - print(window_id) - print(window_name) - print('---') + #print(window_id) + #print(window_name) + #print('---') + if r_set: + all_windows_name.add(window_name) + else: + all_windows_name.append(window_name) + if p.stderr: + print(p.stderr) + return all_windows_name + +def get_screen_windows_id(screen_name): + # detach screen to avoid incomplete result + detach_screen(screen_name) + all_windows_id = {} + cmd = ['screen', '-S', screen_name, '-Q', 'windows'] + p = subprocess.run(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE) + if p.stdout: + for window_row in p.stdout.split(b' '): + window_id, window_name = window_row.decode().split() + if window_name not in all_windows_id: + all_windows_id[window_name] = [] + all_windows_id[window_name].append(window_id) + if p.stderr: + print(p.stderr) + return all_windows_id # script_location ${AIL_BIN} def launch_windows_script(screen_name, window_name, dir_project, script_location, script_name, script_options=''): @@ -60,6 +135,16 @@ def launch_windows_script(screen_name, window_name, dir_project, script_location print(p.stdout) print(p.stderr) +def launch_uniq_windows_script(screen_name, window_name, dir_project, script_location, script_name, script_options='', kill_previous_windows=False): + all_screen_name = get_screen_windows_id(screen_name) + if window_name in all_screen_name: + if kill_previous_windows: + kill_screen_window(screen_name, all_screen_name[window_name][0], force=True) + else: + print('Error: screen {} already contain a windows with this name {}'.format(screen_name, window_name)) + return None + launch_windows_script(screen_name, window_name, dir_project, script_location, script_name, script_options=script_options) + def kill_screen_window(screen_name, window_id, force=False): if force:# kill cmd = ['screen', '-S', screen_name, '-p', window_id, '-X', 'kill'] diff --git a/bin/lib/ConfigLoader.py b/bin/lib/ConfigLoader.py index 262a44bd..b68aa3a3 100755 --- a/bin/lib/ConfigLoader.py +++ b/bin/lib/ConfigLoader.py @@ -64,3 +64,12 @@ class ConfigLoader(object): def has_section(self, section): return self.cfg.has_section(section) + + def get_all_keys_values_from_section(self, section): + if section in self.cfg: + all_keys_values = [] + for key_name in self.cfg[section]: + all_keys_values.append((key_name, self.cfg.get(section, key_name))) + return all_keys_values + else: + return [] diff --git a/bin/lib/Config_DB.py b/bin/lib/Config_DB.py new file mode 100755 index 00000000..67e106ab --- /dev/null +++ b/bin/lib/Config_DB.py @@ -0,0 +1,155 @@ +#!/usr/bin/python3 + +""" +Config save in DB +=================== + + +""" + +import os +import sys +import redis + +sys.path.append(os.path.join(os.environ['AIL_BIN'], 'lib')) +import ConfigLoader + +config_loader = ConfigLoader.ConfigLoader() +r_serv_db = config_loader.get_redis_conn("ARDB_DB") +config_loader = None + +#### TO PUT IN CONFIG +# later => module timeout +# +## data retention +######################### + +default_config = { + "crawler": { + "enable_har_by_default": False, + "enable_screenshot_by_default": True, + "default_depth_limit": 1, + "default_closespider_pagecount": 50, + "default_user_agent": "Mozilla/5.0 (Windows NT 10.0; rv:68.0) Gecko/20100101 Firefox/68.0", + "default_timeout": 30 + } +} + +def get_default_config(): + return default_config + +def get_default_config_value(section, field): + return default_config[section][field] + +config_type = { + # crawler config + "crawler": { + "enable_har_by_default": bool, + "enable_screenshot_by_default": bool, + "default_depth_limit": int, + "default_closespider_pagecount": int, + "default_user_agent": str, + "default_timeout": int + } +} + +def get_config_type(section, field): + return config_type[section][field] + +# # TODO: add set, dict, list and select_(multiple_)value +def is_valid_type(obj, section, field, value_type=None): + res = isinstance(obj, get_config_type(section, field)) + return res + +def reset_default_config(): + pass + +def set_default_config(section, field): + save_config(section, field, get_default_config_value(section, field)) + +def get_all_config_sections(): + return list(get_default_config()) + +def get_all_config_fields_by_section(section): + return list(get_default_config()[section]) + +def get_config(section, field): + # config field don't exist + if not r_serv_db.hexists(f'config:global:{section}', field): + set_default_config(section, field) + return get_default_config_value(section, field) + + # load default config section + if not r_serv_db.exists('config:global:{}'.format(section)): + save_config(section, field, get_default_config_value(section, field)) + return get_default_config_value(section, field) + + return r_serv_db.hget(f'config:global:{section}', field) + +def get_config_dict_by_section(section): + config_dict = {} + for field in get_all_config_fields_by_section(section): + config_dict[field] = get_config(section, field) + return config_dict + +def save_config(section, field, value, value_type=None): ########################################### + if section in default_config: + if is_valid_type(value, section, field, value_type=value_type): + if value_type in ['list', 'set', 'dict']: + pass + else: + r_serv_db.hset(f'config:global:{section}', field, value) + # used by check_integrity + r_serv_db.sadd('config:all_global_section', field, value) + +# check config value + type +def check_integrity(): + pass + + +config_documentation = { + "crawler": { + "enable_har_by_default": 'Enable HAR by default', + "enable_screenshot_by_default": 'Enable screenshot by default', + "default_depth_limit": 'Maximum number of url depth', + "default_closespider_pagecount": 'Maximum number of pages', + "default_user_agent": "User agent used by default", + "default_timeout": "Crawler connection timeout" + } +} + +def get_config_documentation(section, field): + return config_documentation[section][field] + +# def conf_view(): +# class F(MyBaseForm): +# pass +# +# F.username = TextField('username') +# for name in iterate_some_model_dynamically(): +# setattr(F, name, TextField(name.title())) +# +# form = F(request.POST, ...) + +def get_field_full_config(section, field): + dict_config = {} + dict_config['value'] = get_config(section, field) + dict_config['type'] = get_config_type(section, field) + dict_config['info'] = get_config_documentation(section, field) + return dict_config + +def get_full_config_by_section(section): + dict_config = {} + for field in get_all_config_fields_by_section(section): + dict_config[field] = get_field_full_config(section, field) + return dict_config + +def get_full_config(): + dict_config = {} + for section in get_all_config_sections(): + dict_config[section] = get_full_config_by_section(section) + return dict_config + +if __name__ == '__main__': + res = get_full_config() + print(res) diff --git a/bin/lib/crawlers.py b/bin/lib/crawlers.py index ed60fb62..64aa0e7a 100755 --- a/bin/lib/crawlers.py +++ b/bin/lib/crawlers.py @@ -13,6 +13,7 @@ import os import re import redis import sys +import time import uuid from datetime import datetime, timedelta @@ -34,19 +35,24 @@ config_loader = ConfigLoader.ConfigLoader() r_serv_metadata = config_loader.get_redis_conn("ARDB_Metadata") r_serv_onion = config_loader.get_redis_conn("ARDB_Onion") r_cache = config_loader.get_redis_conn("Redis_Cache") -config_loader = None - -# load crawler config -config_loader = ConfigLoader.ConfigLoader(config_file='crawlers.cfg') -#splash_manager_url = config_loader.get_config_str('Splash_Manager', 'splash_url') -#splash_api_key = config_loader.get_config_str('Splash_Manager', 'api_key') +PASTES_FOLDER = os.path.join(os.environ['AIL_HOME'], config_loader.get_config_str("Directories", "pastes")) config_loader = None faup = Faup() +# # # # # # # # +# # +# COMMON # +# # +# # # # # # # # + def generate_uuid(): return str(uuid.uuid4()).replace('-', '') +# # TODO: remove me ? +def get_current_date(): + return datetime.now().strftime("%Y%m%d") + def is_valid_onion_domain(domain): if not domain.endswith('.onion'): return False @@ -61,6 +67,10 @@ def is_valid_onion_domain(domain): return True return False +# TEMP FIX +def get_faup(): + return faup + ################################################################################ # # TODO: handle prefix cookies @@ -389,8 +399,127 @@ def api_create_cookie(user_id, cookiejar_uuid, cookie_dict): #### #### +# # # # # # # # +# # +# CRAWLER # +# # +# # # # # # # # + +#### CRAWLER GLOBAL #### + +def get_all_spash_crawler_status(): + crawler_metadata = [] + all_crawlers = r_cache.smembers('all_splash_crawlers') + for crawler in all_crawlers: + crawler_metadata.append(get_splash_crawler_status(crawler)) + return crawler_metadata + +def reset_all_spash_crawler_status(): + r_cache.delete('all_splash_crawlers') + +def get_splash_crawler_status(spash_url): + crawler_type = r_cache.hget('metadata_crawler:{}'.format(spash_url), 'type') + crawling_domain = r_cache.hget('metadata_crawler:{}'.format(spash_url), 'crawling_domain') + started_time = r_cache.hget('metadata_crawler:{}'.format(spash_url), 'started_time') + status_info = r_cache.hget('metadata_crawler:{}'.format(spash_url), 'status') + crawler_info = '{} - {}'.format(spash_url, started_time) + if status_info=='Waiting' or status_info=='Crawling': + status=True + else: + status=False + return {'crawler_info': crawler_info, 'crawling_domain': crawling_domain, 'status_info': status_info, 'status': status, 'type': crawler_type} + +def get_stats_last_crawled_domains(crawler_types, date): + statDomains = {} + for crawler_type in crawler_types: + stat_type = {} + stat_type['domains_up'] = r_serv_onion.scard('{}_up:{}'.format(crawler_type, date)) + stat_type['domains_down'] = r_serv_onion.scard('{}_down:{}'.format(crawler_type, date)) + stat_type['total'] = stat_type['domains_up'] + stat_type['domains_down'] + stat_type['domains_queue'] = get_nb_elem_to_crawl_by_type(crawler_type) + statDomains[crawler_type] = stat_type + return statDomains + +# # TODO: handle custom proxy +def get_splash_crawler_latest_stats(): + now = datetime.now() + date = now.strftime("%Y%m%d") + return get_stats_last_crawled_domains(['onion', 'regular'], date) + +def get_nb_crawlers_to_launch_by_splash_name(splash_name): + res = r_serv_onion.hget('all_crawlers_to_launch', splash_name) + if res: + return int(res) + else: + return 0 + +def get_all_crawlers_to_launch_splash_name(): + return r_serv_onion.hkeys('all_crawlers_to_launch') + +def get_nb_crawlers_to_launch(): + nb_crawlers_to_launch = r_serv_onion.hgetall('all_crawlers_to_launch') + for splash_name in nb_crawlers_to_launch: + nb_crawlers_to_launch[splash_name] = int(nb_crawlers_to_launch[splash_name]) + return nb_crawlers_to_launch + +def get_nb_crawlers_to_launch_ui(): + nb_crawlers_to_launch = get_nb_crawlers_to_launch() + for splash_name in get_all_splash(): + if splash_name not in nb_crawlers_to_launch: + nb_crawlers_to_launch[splash_name] = 0 + return nb_crawlers_to_launch + +def set_nb_crawlers_to_launch(dict_splash_name): + r_serv_onion.delete('all_crawlers_to_launch') + for splash_name in dict_splash_name: + r_serv_onion.hset('all_crawlers_to_launch', splash_name, int(dict_splash_name[splash_name])) + relaunch_crawlers() + +def relaunch_crawlers(): + all_crawlers_to_launch = get_nb_crawlers_to_launch() + for splash_name in all_crawlers_to_launch: + nb_crawlers = int(all_crawlers_to_launch[splash_name]) + + all_crawler_urls = get_splash_all_url(splash_name, r_list=True) + if nb_crawlers > len(all_crawler_urls): + print('Error, can\'t launch all Splash Dockers') + print('Please launch {} additional {} Dockers'.format( nb_crawlers - len(all_crawler_urls), splash_name)) + nb_crawlers = len(all_crawler_urls) + + reset_all_spash_crawler_status() + + for i in range(0, int(nb_crawlers)): + splash_url = all_crawler_urls[i] + print(all_crawler_urls[i]) + + launch_ail_splash_crawler(splash_url, script_options='{}'.format(splash_url)) + +def api_set_nb_crawlers_to_launch(dict_splash_name): + # TODO: check if is dict + dict_crawlers_to_launch = {} + all_splash = get_all_splash() + crawlers_to_launch = list(all_splash & set(dict_splash_name.keys())) + for splash_name in crawlers_to_launch: + try: + nb_to_launch = int(dict_splash_name.get(splash_name, 0)) + if nb_to_launch < 0: + return ({'error':'The number of crawlers to launch is negative'}, 400) + except: + return ({'error':'invalid number of crawlers to launch'}, 400) + if nb_to_launch > 0: + dict_crawlers_to_launch[splash_name] = nb_to_launch + + if dict_crawlers_to_launch: + set_nb_crawlers_to_launch(dict_crawlers_to_launch) + return (dict_crawlers_to_launch, 200) + else: + return ({'error':'invalid input'}, 400) + + +##-- CRAWLER GLOBAL --## + #### CRAWLER TASK #### -def create_crawler_task(url, screenshot=True, har=True, depth_limit=1, max_pages=100, auto_crawler=False, crawler_delta=3600, cookiejar_uuid=None, user_agent=None): +def create_crawler_task(url, screenshot=True, har=True, depth_limit=1, max_pages=100, auto_crawler=False, crawler_delta=3600, crawler_type=None, cookiejar_uuid=None, user_agent=None): crawler_config = {} crawler_config['depth_limit'] = depth_limit @@ -430,10 +559,18 @@ def create_crawler_task(url, screenshot=True, har=True, depth_limit=1, max_pages tld = unpack_url['tld'].decode() except: tld = unpack_url['tld'] - if tld == 'onion': - crawler_type = 'onion' + + if crawler_type=='None': + crawler_type = None + + if crawler_type: + if crawler_type=='tor': + crawler_type = 'onion' else: - crawler_type = 'regular' + if tld == 'onion': + crawler_type = 'onion' + else: + crawler_type = 'regular' save_crawler_config(crawler_mode, crawler_type, crawler_config, domain, url=url) send_url_to_crawl_in_queue(crawler_mode, crawler_type, url) @@ -445,6 +582,7 @@ def save_crawler_config(crawler_mode, crawler_type, crawler_config, domain, url= r_serv_onion.set('crawler_config:{}:{}:{}:{}'.format(crawler_mode, crawler_type, domain, url), json.dumps(crawler_config)) def send_url_to_crawl_in_queue(crawler_mode, crawler_type, url): + print('{}_crawler_priority_queue'.format(crawler_type), '{};{}'.format(url, crawler_mode)) r_serv_onion.sadd('{}_crawler_priority_queue'.format(crawler_type), '{};{}'.format(url, crawler_mode)) # add auto crawled url for user UI if crawler_mode == 'auto': @@ -452,7 +590,7 @@ def send_url_to_crawl_in_queue(crawler_mode, crawler_type, url): #### #### #### CRAWLER TASK API #### -def api_create_crawler_task(user_id, url, screenshot=True, har=True, depth_limit=1, max_pages=100, auto_crawler=False, crawler_delta=3600, cookiejar_uuid=None, user_agent=None): +def api_create_crawler_task(user_id, url, screenshot=True, har=True, depth_limit=1, max_pages=100, auto_crawler=False, crawler_delta=3600, crawler_type=None, cookiejar_uuid=None, user_agent=None): # validate url if url is None or url=='' or url=='\n': return ({'error':'invalid depth limit'}, 400) @@ -489,7 +627,10 @@ def api_create_crawler_task(user_id, url, screenshot=True, har=True, depth_limit if cookie_owner != user_id: return ({'error': 'The access to this cookiejar is restricted'}, 403) + # # TODO: verify splash name/crawler type + create_crawler_task(url, screenshot=screenshot, har=har, depth_limit=depth_limit, max_pages=max_pages, + crawler_type=crawler_type, auto_crawler=auto_crawler, crawler_delta=crawler_delta, cookiejar_uuid=cookiejar_uuid, user_agent=user_agent) return None @@ -572,6 +713,7 @@ def save_har(har_dir, item_id, har_content): with open(filename, 'w') as f: f.write(json.dumps(har_content)) +# # TODO: FIXME def api_add_crawled_item(dict_crawled): domain = None @@ -580,30 +722,200 @@ def api_add_crawled_item(dict_crawled): save_crawled_item(item_id, response.data['html']) create_item_metadata(item_id, domain, 'last_url', port, 'father') +#### CRAWLER QUEUES #### +def get_all_crawlers_queues_types(): + all_queues_types = set() + all_splash_name = get_all_crawlers_to_launch_splash_name() + for splash_name in all_splash_name: + all_queues_types.add(get_splash_crawler_type(splash_name)) + all_splash_name = list() + return all_queues_types -#### SPLASH MANAGER #### -def get_splash_manager_url(reload=False): # TODO: add config reload - return splash_manager_url +def get_crawler_queue_types_by_splash_name(splash_name): + all_domain_type = [splash_name] + crawler_type = get_splash_crawler_type(splash_name) + #if not is_splash_used_in_discovery(splash_name) + if crawler_type == 'tor': + all_domain_type.append('onion') + all_domain_type.append('regular') + else: + all_domain_type.append('regular') + return all_domain_type -def get_splash_api_key(reload=False): # TODO: add config reload - return splash_api_key +def get_crawler_type_by_url(url): + faup.decode(url) + unpack_url = faup.get() + ## TODO: # FIXME: remove me + try: + tld = unpack_url['tld'].decode() + except: + tld = unpack_url['tld'] + + if tld == 'onion': + crawler_type = 'onion' + else: + crawler_type = 'regular' + return crawler_type + + +def get_elem_to_crawl_by_queue_type(l_queue_type): + ## queues priority: + # 1 - priority queue + # 2 - discovery queue + # 3 - normal queue + ## + all_queue_key = ['{}_crawler_priority_queue', '{}_crawler_discovery_queue', '{}_crawler_queue'] + + for queue_key in all_queue_key: + for queue_type in l_queue_type: + message = r_serv_onion.spop(queue_key.format(queue_type)) + if message: + dict_to_crawl = {} + splitted = message.rsplit(';', 1) + if len(splitted) == 2: + url, item_id = splitted + item_id = item_id.replace(PASTES_FOLDER+'/', '') + else: + # # TODO: to check/refractor + item_id = None + url = message + crawler_type = get_crawler_type_by_url(url) + return {'url': url, 'paste': item_id, 'type_service': crawler_type, 'queue_type': queue_type, 'original_message': message} + return None + +def get_nb_elem_to_crawl_by_type(queue_type): + nb = r_serv_onion.scard('{}_crawler_priority_queue'.format(queue_type)) + nb += r_serv_onion.scard('{}_crawler_discovery_queue'.format(queue_type)) + nb += r_serv_onion.scard('{}_crawler_queue'.format(queue_type)) + return nb + +#### ---- #### + +# # # # # # # # # # # # +# # +# SPLASH MANAGER # +# # +# # # # # # # # # # # # +def get_splash_manager_url(reload=False): # TODO: add in db config + return r_serv_onion.get('crawler:splash:manager:url') + +def get_splash_api_key(reload=False): # TODO: add in db config + return r_serv_onion.get('crawler:splash:manager:key') + +def get_hidden_splash_api_key(): # TODO: add in db config + key = get_splash_api_key() + if key: + if len(key)==41: + return f'{key[:4]}*********************************{key[-4:]}' + +def is_valid_api_key(api_key, search=re.compile(r'[^a-zA-Z0-9_-]').search): + if len(api_key) != 41: + return False + return not bool(search(api_key)) + +def save_splash_manager_url_api(url, api_key): + r_serv_onion.set('crawler:splash:manager:url', url) + r_serv_onion.set('crawler:splash:manager:key', api_key) def get_splash_url_from_manager_url(splash_manager_url, splash_port): url = urlparse(splash_manager_url) host = url.netloc.split(':', 1)[0] - return 'http://{}:{}'.format(host, splash_port) + return '{}:{}'.format(host, splash_port) + +# def is_splash_used_in_discovery(splash_name): +# res = r_serv_onion.hget('splash:metadata:{}'.format(splash_name), 'discovery_queue') +# if res == 'True': +# return True +# else: +# return False + +def restart_splash_docker(splash_url, splash_name): + splash_port = splash_url.split(':')[-1] + return _restart_splash_docker(splash_port, splash_name) + +def is_splash_manager_connected(delta_check=30): + last_check = r_cache.hget('crawler:splash:manager', 'last_check') + if last_check: + if int(time.time()) - int(last_check) > delta_check: + ping_splash_manager() + else: + ping_splash_manager() + res = r_cache.hget('crawler:splash:manager', 'connected') + return res == 'True' + +def update_splash_manager_connection_status(is_connected, req_error=None): + r_cache.hset('crawler:splash:manager', 'connected', is_connected) + r_cache.hset('crawler:splash:manager', 'last_check', int(time.time())) + if not req_error: + r_cache.hdel('crawler:splash:manager', 'error') + else: + r_cache.hset('crawler:splash:manager', 'status_code', req_error['status_code']) + r_cache.hset('crawler:splash:manager', 'error', req_error['error']) + +def get_splash_manager_connection_metadata(force_ping=False): + dict_manager={} + if force_ping: + dict_manager['status'] = ping_splash_manager() + else: + dict_manager['status'] = is_splash_manager_connected() + if not dict_manager['status']: + dict_manager['status_code'] = r_cache.hget('crawler:splash:manager', 'status_code') + dict_manager['error'] = r_cache.hget('crawler:splash:manager', 'error') + return dict_manager ## API ## def ping_splash_manager(): - req = requests.get('{}/api/v1/ping'.format(get_splash_manager_url()), headers={"Authorization": get_splash_api_key()}, verify=False) - if req.status_code == 200: - return True - else: - print(req.json()) + splash_manager_url = get_splash_manager_url() + if not splash_manager_url: return False + try: + req = requests.get('{}/api/v1/ping'.format(splash_manager_url), headers={"Authorization": get_splash_api_key()}, verify=False) + if req.status_code == 200: + update_splash_manager_connection_status(True) + return True + else: + res = req.json() + if 'reason' in res: + req_error = {'status_code': req.status_code, 'error': res['reason']} + else: + print(req.json()) + req_error = {'status_code': req.status_code, 'error': json.dumps(req.json())} + update_splash_manager_connection_status(False, req_error=req_error) + return False + except requests.exceptions.ConnectionError: + pass + # splash manager unreachable + req_error = {'status_code': 500, 'error': 'splash manager unreachable'} + update_splash_manager_connection_status(False, req_error=req_error) + return False + +def get_splash_manager_session_uuid(): + try: + req = requests.get('{}/api/v1/get/session_uuid'.format(get_splash_manager_url()), headers={"Authorization": get_splash_api_key()}, verify=False) + if req.status_code == 200: + res = req.json() + if res: + return res['session_uuid'] + else: + print(req.json()) + except (requests.exceptions.ConnectionError, requests.exceptions.MissingSchema): + # splash manager unreachable + update_splash_manager_connection_status(False) + +def get_splash_manager_version(): + splash_manager_url = get_splash_manager_url() + if splash_manager_url: + try: + req = requests.get('{}/api/v1/version'.format(splash_manager_url), headers={"Authorization": get_splash_api_key()}, verify=False) + if req.status_code == 200: + return req.json()['message'] + else: + print(req.json()) + except requests.exceptions.ConnectionError: + pass def get_all_splash_manager_containers_name(): - req = requests.get('{}/api/v1/get/splash/name/all'.format(get_splash_manager_url()), headers={"Authorization": get_splash_api_key()}, verify=False) + req = requests.get('{}/api/v1/get/splash/all'.format(get_splash_manager_url()), headers={"Authorization": get_splash_api_key()}, verify=False) if req.status_code == 200: return req.json() else: @@ -615,6 +927,35 @@ def get_all_splash_manager_proxies(): return req.json() else: print(req.json()) + +def _restart_splash_docker(splash_port, splash_name): + dict_to_send = {'port': splash_port, 'name': splash_name} + req = requests.post('{}/api/v1/splash/restart'.format(get_splash_manager_url()), headers={"Authorization": get_splash_api_key()}, verify=False, json=dict_to_send) + if req.status_code == 200: + return req.json() + else: + print(req.json()) + +def api_save_splash_manager_url_api(data): + # unpack json + manager_url = data.get('url', None) + api_key = data.get('api_key', None) + if not manager_url or not api_key: + return ({'status': 'error', 'reason': 'No url or API key supplied'}, 400) + # check if is valid url + try: + result = urlparse(manager_url) + if not all([result.scheme, result.netloc]): + return ({'status': 'error', 'reason': 'Invalid url'}, 400) + except: + return ({'status': 'error', 'reason': 'Invalid url'}, 400) + + # check if is valid key + if not is_valid_api_key(api_key): + return ({'status': 'error', 'reason': 'Invalid API key'}, 400) + + save_splash_manager_url_api(manager_url, api_key) + return ({'url': manager_url, 'api_key': get_hidden_splash_api_key()}, 200) ## -- ## ## SPLASH ## @@ -647,7 +988,23 @@ def get_splash_name_by_url(splash_url): def get_splash_crawler_type(splash_name): return r_serv_onion.hget('splash:metadata:{}'.format(splash_name), 'crawler_type') -def get_all_splash_by_proxy(proxy_name): +def get_splash_crawler_description(splash_name): + return r_serv_onion.hget('splash:metadata:{}'.format(splash_name), 'description') + +def get_splash_crawler_metadata(splash_name): + dict_splash = {} + dict_splash['proxy'] = get_splash_proxy(splash_name) + dict_splash['type'] = get_splash_crawler_type(splash_name) + dict_splash['description'] = get_splash_crawler_description(splash_name) + return dict_splash + +def get_all_splash_crawler_metadata(): + dict_splash = {} + for splash_name in get_all_splash(): + dict_splash[splash_name] = get_splash_crawler_metadata(splash_name) + return dict_splash + +def get_all_splash_by_proxy(proxy_name, r_list=False): res = r_serv_onion.smembers('proxy:splash:{}'.format(proxy_name)) if res: if r_list: @@ -683,16 +1040,50 @@ def delete_all_proxies(): for proxy_name in get_all_proxies(): delete_proxy(proxy_name) +def get_proxy_host(proxy_name): + return r_serv_onion.hget('proxy:metadata:{}'.format(proxy_name), 'host') + +def get_proxy_port(proxy_name): + return r_serv_onion.hget('proxy:metadata:{}'.format(proxy_name), 'port') + +def get_proxy_type(proxy_name): + return r_serv_onion.hget('proxy:metadata:{}'.format(proxy_name), 'type') + +def get_proxy_crawler_type(proxy_name): + return r_serv_onion.hget('proxy:metadata:{}'.format(proxy_name), 'crawler_type') + +def get_proxy_description(proxy_name): + return r_serv_onion.hget('proxy:metadata:{}'.format(proxy_name), 'description') + +def get_proxy_metadata(proxy_name): + meta_dict = {} + meta_dict['host'] = get_proxy_host(proxy_name) + meta_dict['port'] = get_proxy_port(proxy_name) + meta_dict['type'] = get_proxy_type(proxy_name) + meta_dict['crawler_type'] = get_proxy_crawler_type(proxy_name) + meta_dict['description'] = get_proxy_description(proxy_name) + return meta_dict + +def get_all_proxies_metadata(): + all_proxy_dict = {} + for proxy_name in get_all_proxies(): + all_proxy_dict[proxy_name] = get_proxy_metadata(proxy_name) + return all_proxy_dict + +# def set_proxy_used_in_discovery(proxy_name, value): +# r_serv_onion.hset('splash:metadata:{}'.format(splash_name), 'discovery_queue', value) + def delete_proxy(proxy_name): # # TODO: force delete (delete all proxy) proxy_splash = get_all_splash_by_proxy(proxy_name) - if proxy_splash: - print('error, a splash container is using this proxy') + #if proxy_splash: + # print('error, a splash container is using this proxy') r_serv_onion.delete('proxy:metadata:{}'.format(proxy_name)) r_serv_onion.srem('all_proxy', proxy_name) ## -- ## ## LOADER ## def load_all_splash_containers(): + delete_all_splash_containers() all_splash_containers_name = get_all_splash_manager_containers_name() for splash_name in all_splash_containers_name: r_serv_onion.sadd('all_splash', splash_name) @@ -715,6 +1106,7 @@ def load_all_splash_containers(): r_serv_onion.set('splash:map:url:name:{}'.format(splash_url), splash_name) def load_all_proxy(): + delete_all_proxies() all_proxies = get_all_splash_manager_proxies() for proxy_name in all_proxies: proxy_dict = all_proxies[proxy_name] @@ -725,13 +1117,17 @@ def load_all_proxy(): description = all_proxies[proxy_name].get('description', None) if description: r_serv_onion.hset('proxy:metadata:{}'.format(proxy_name), 'description', description) + r_serv_onion.sadd('all_proxy', proxy_name) -def init_splash_list_db(): - delete_all_splash_containers() - delete_all_proxies() +def reload_splash_and_proxies_list(): if ping_splash_manager(): - load_all_splash_containers() + # LOAD PROXIES containers load_all_proxy() + # LOAD SPLASH containers + load_all_splash_containers() + return True + else: + return False # # TODO: kill crawler screen ? ## -- ## @@ -742,7 +1138,7 @@ def launch_ail_splash_crawler(splash_url, script_options=''): script_location = os.path.join(os.environ['AIL_BIN']) script_name = 'Crawler.py' screen.create_screen(screen_name) - screen.launch_windows_script(screen_name, splash_url, dir_project, script_location, script_name, script_options=script_options) + screen.launch_uniq_windows_script(screen_name, splash_url, dir_project, script_location, script_name, script_options=script_options, kill_previous_windows=True) ## -- ## @@ -752,3 +1148,8 @@ def launch_ail_splash_crawler(splash_url, script_options=''): #### CRAWLER PROXY #### #### ---- #### + +if __name__ == '__main__': + res = get_splash_manager_version() + #res = restart_splash_docker('127.0.0.1:8050', 'default_splash_tor') + print(res) diff --git a/doc/screenshots/splash_manager_config_edit_1.png b/doc/screenshots/splash_manager_config_edit_1.png new file mode 100644 index 00000000..5de9a2b0 Binary files /dev/null and b/doc/screenshots/splash_manager_config_edit_1.png differ diff --git a/doc/screenshots/splash_manager_config_edit_2.png b/doc/screenshots/splash_manager_config_edit_2.png new file mode 100644 index 00000000..eeea02fa Binary files /dev/null and b/doc/screenshots/splash_manager_config_edit_2.png differ diff --git a/doc/screenshots/splash_manager_nb_crawlers_1.png b/doc/screenshots/splash_manager_nb_crawlers_1.png new file mode 100644 index 00000000..885b5d3f Binary files /dev/null and b/doc/screenshots/splash_manager_nb_crawlers_1.png differ diff --git a/doc/screenshots/splash_manager_nb_crawlers_2.png b/doc/screenshots/splash_manager_nb_crawlers_2.png new file mode 100644 index 00000000..e0bad14f Binary files /dev/null and b/doc/screenshots/splash_manager_nb_crawlers_2.png differ diff --git a/etc/splash/proxy-profiles/default.ini b/etc/splash/proxy-profiles/default.ini deleted file mode 100644 index 91208135..00000000 --- a/etc/splash/proxy-profiles/default.ini +++ /dev/null @@ -1,4 +0,0 @@ -[proxy] -host=localhost -port=9050 -type=SOCKS5 diff --git a/install_virtualenv.sh b/install_virtualenv.sh index f20cb93f..aa97b7cc 100755 --- a/install_virtualenv.sh +++ b/install_virtualenv.sh @@ -16,9 +16,11 @@ if [ -z "$VIRTUAL_ENV" ]; then echo export AIL_REDIS=$(pwd)/redis/src/ >> ./AILENV/bin/activate echo export AIL_ARDB=$(pwd)/ardb/src/ >> ./AILENV/bin/activate - . ./AILENV/bin/activate fi +# activate virtual environment +. ./AILENV/bin/activate + pip3 install -U pip pip3 install 'git+https://github.com/D4-project/BGP-Ranking.git/@7e698f87366e6f99b4d0d11852737db28e3ddc62#egg=pybgpranking&subdirectory=client' pip3 install -U -r requirements.txt diff --git a/var/www/blueprints/crawler_splash.py b/var/www/blueprints/crawler_splash.py index f80b3967..a5f6d548 100644 --- a/var/www/blueprints/crawler_splash.py +++ b/var/www/blueprints/crawler_splash.py @@ -24,10 +24,12 @@ sys.path.append(os.path.join(os.environ['AIL_BIN'], 'packages')) import Tag sys.path.append(os.path.join(os.environ['AIL_BIN'], 'lib')) -import Domain import crawlers +import Domain import Language +import Config_DB + r_cache = Flask_config.r_cache r_serv_db = Flask_config.r_serv_db r_serv_tags = Flask_config.r_serv_tags @@ -49,13 +51,44 @@ def create_json_response(data, status_code): return Response(json.dumps(data, indent=2, sort_keys=True), mimetype='application/json'), status_code # ============= ROUTES ============== +@crawler_splash.route("/crawlers/dashboard", methods=['GET']) +@login_required +@login_read_only +def crawlers_dashboard(): + # # TODO: get splash manager status + is_manager_connected = crawlers.get_splash_manager_connection_metadata() + all_splash_crawler_status = crawlers.get_all_spash_crawler_status() + splash_crawlers_latest_stats = crawlers.get_splash_crawler_latest_stats() + date = crawlers.get_current_date() + + return render_template("dashboard_splash_crawler.html", all_splash_crawler_status = all_splash_crawler_status, + is_manager_connected=is_manager_connected, date=date, + splash_crawlers_latest_stats=splash_crawlers_latest_stats) + +@crawler_splash.route("/crawlers/crawler_dashboard_json", methods=['GET']) +@login_required +@login_read_only +def crawler_dashboard_json(): + + all_splash_crawler_status = crawlers.get_all_spash_crawler_status() + splash_crawlers_latest_stats = crawlers.get_splash_crawler_latest_stats() + + return jsonify({'all_splash_crawler_status': all_splash_crawler_status, + 'splash_crawlers_latest_stats':splash_crawlers_latest_stats}) + @crawler_splash.route("/crawlers/manual", methods=['GET']) @login_required @login_read_only def manual(): user_id = current_user.get_id() l_cookiejar = crawlers.api_get_cookies_list_select(user_id) - return render_template("crawler_manual.html", crawler_enabled=True, l_cookiejar=l_cookiejar) + all_crawlers_types = crawlers.get_all_crawlers_queues_types() + all_splash_name = crawlers.get_all_crawlers_to_launch_splash_name() + return render_template("crawler_manual.html", + is_manager_connected=crawlers.get_splash_manager_connection_metadata(), + all_crawlers_types=all_crawlers_types, + all_splash_name=all_splash_name, + l_cookiejar=l_cookiejar) @crawler_splash.route("/crawlers/send_to_spider", methods=['POST']) @login_required @@ -65,6 +98,8 @@ def send_to_spider(): # POST val url = request.form.get('url_to_crawl') + crawler_type = request.form.get('crawler_queue_type') + splash_name = request.form.get('splash_name') auto_crawler = request.form.get('crawler_type') crawler_delta = request.form.get('crawler_epoch') screenshot = request.form.get('screenshot') @@ -73,6 +108,9 @@ def send_to_spider(): max_pages = request.form.get('max_pages') cookiejar_uuid = request.form.get('cookiejar') + if splash_name: + crawler_type = splash_name + if cookiejar_uuid: if cookiejar_uuid == 'None': cookiejar_uuid = None @@ -81,6 +119,7 @@ def send_to_spider(): cookiejar_uuid = cookiejar_uuid[-1].replace(' ', '') res = crawlers.api_create_crawler_task(user_id, url, screenshot=screenshot, har=har, depth_limit=depth_limit, max_pages=max_pages, + crawler_type=crawler_type, auto_crawler=auto_crawler, crawler_delta=crawler_delta, cookiejar_uuid=cookiejar_uuid) if res: return create_json_response(res[0], res[1]) @@ -459,4 +498,61 @@ def crawler_cookiejar_cookie_json_add_post(): return redirect(url_for('crawler_splash.crawler_cookiejar_cookie_add', cookiejar_uuid=cookiejar_uuid)) +@crawler_splash.route('/crawler/settings', methods=['GET']) +@login_required +@login_analyst +def crawler_splash_setings(): + all_proxies = crawlers.get_all_proxies_metadata() + all_splash = crawlers.get_all_splash_crawler_metadata() + nb_crawlers_to_launch = crawlers.get_nb_crawlers_to_launch() + + splash_manager_url = crawlers.get_splash_manager_url() + api_key = crawlers.get_hidden_splash_api_key() + is_manager_connected = crawlers.get_splash_manager_connection_metadata(force_ping=True) + crawler_full_config = Config_DB.get_full_config_by_section('crawler') + + return render_template("settings_splash_crawler.html", + is_manager_connected=is_manager_connected, + splash_manager_url=splash_manager_url, api_key=api_key, + nb_crawlers_to_launch=nb_crawlers_to_launch, + all_splash=all_splash, all_proxies=all_proxies, + crawler_full_config=crawler_full_config) + +@crawler_splash.route('/crawler/settings/crawler_manager', methods=['GET', 'POST']) +@login_required +@login_admin +def crawler_splash_setings_crawler_manager(): + if request.method == 'POST': + splash_manager_url = request.form.get('splash_manager_url') + api_key = request.form.get('api_key') + + res = crawlers.api_save_splash_manager_url_api({'url':splash_manager_url, 'api_key':api_key}) + if res[1] != 200: + return Response(json.dumps(res[0], indent=2, sort_keys=True), mimetype='application/json'), res[1] + else: + return redirect(url_for('crawler_splash.crawler_splash_setings')) + else: + splash_manager_url = crawlers.get_splash_manager_url() + api_key = crawlers.get_splash_api_key() + return render_template("settings_edit_splash_crawler_manager.html", + splash_manager_url=splash_manager_url, api_key=api_key) + +@crawler_splash.route('/crawler/settings/crawlers_to_lauch', methods=['GET', 'POST']) +@login_required +@login_admin +def crawler_splash_setings_crawlers_to_lauch(): + if request.method == 'POST': + dict_splash_name = {} + for crawler_name in list(request.form): + dict_splash_name[crawler_name]= request.form.get(crawler_name) + res = crawlers.api_set_nb_crawlers_to_launch(dict_splash_name) + if res[1] != 200: + return Response(json.dumps(res[0], indent=2, sort_keys=True), mimetype='application/json'), res[1] + else: + return redirect(url_for('crawler_splash.crawler_splash_setings')) + else: + nb_crawlers_to_launch = crawlers.get_nb_crawlers_to_launch_ui() + return render_template("settings_edit_crawlers_to_launch.html", + nb_crawlers_to_launch=nb_crawlers_to_launch) + ## - - ## diff --git a/var/www/blueprints/root.py b/var/www/blueprints/root.py index 2c69be46..9e9a62da 100644 --- a/var/www/blueprints/root.py +++ b/var/www/blueprints/root.py @@ -74,10 +74,13 @@ def login(): if user.request_password_change(): return redirect(url_for('root.change_password')) else: - if next_page and next_page!='None': + # update note + # next page + if next_page and next_page!='None' and next_page!='/': return redirect(next_page) + # dashboard else: - return redirect(url_for('dashboard.index')) + return redirect(url_for('dashboard.index', update_note=True)) # login failed else: # set brute force protection @@ -113,7 +116,9 @@ def change_password(): if check_password_strength(password1): user_id = current_user.get_id() create_user_db(user_id , password1, update=True) - return redirect(url_for('dashboard.index')) + # update Note + # dashboard + return redirect(url_for('dashboard.index', update_note=True)) else: error = 'Incorrect password' return render_template("change_password.html", error=error) diff --git a/var/www/modules/dashboard/Flask_dashboard.py b/var/www/modules/dashboard/Flask_dashboard.py index d57c4e67..7ba7b165 100644 --- a/var/www/modules/dashboard/Flask_dashboard.py +++ b/var/www/modules/dashboard/Flask_dashboard.py @@ -155,6 +155,8 @@ def stuff(): @login_required @login_read_only def index(): + update_note = request.args.get('update_note') + default_minute = config_loader.get_config_str("Flask", "minute_processed_paste") threshold_stucked_module = config_loader.get_config_int("Module_ModuleInformation", "threshold_stucked_module") log_select = {10, 25, 50, 100} @@ -176,6 +178,7 @@ def index(): return render_template("index.html", default_minute = default_minute, threshold_stucked_module=threshold_stucked_module, log_select=log_select, selected=max_dashboard_logs, update_warning_message=update_warning_message, update_in_progress=update_in_progress, + update_note=update_note, update_warning_message_notice_me=update_warning_message_notice_me) # ========= REGISTRATION ========= diff --git a/var/www/modules/dashboard/templates/index.html b/var/www/modules/dashboard/templates/index.html index 5d40df1c..19f86147 100644 --- a/var/www/modules/dashboard/templates/index.html +++ b/var/www/modules/dashboard/templates/index.html @@ -72,12 +72,10 @@ {%endif%} - + + {%if update_note%} + {% include 'dashboard/update_modal.html' %} + {%endif%}
diff --git a/var/www/modules/hiddenServices/Flask_hiddenServices.py b/var/www/modules/hiddenServices/Flask_hiddenServices.py index bab5553a..55a7abe4 100644 --- a/var/www/modules/hiddenServices/Flask_hiddenServices.py +++ b/var/www/modules/hiddenServices/Flask_hiddenServices.py @@ -18,6 +18,7 @@ from flask_login import login_required from Date import Date from HiddenServices import HiddenServices +import crawlers # ============ VARIABLES ============ import Flask_config @@ -27,7 +28,6 @@ baseUrl = Flask_config.baseUrl r_cache = Flask_config.r_cache r_serv_onion = Flask_config.r_serv_onion r_serv_metadata = Flask_config.r_serv_metadata -crawler_enabled = Flask_config.crawler_enabled bootstrap_label = Flask_config.bootstrap_label sys.path.append(os.path.join(os.environ['AIL_BIN'], 'lib')) @@ -231,22 +231,22 @@ def delete_auto_crawler(url): # ============= ROUTES ============== -@hiddenServices.route("/crawlers/", methods=['GET']) -@login_required -@login_read_only -def dashboard(): - crawler_metadata_onion = get_crawler_splash_status('onion') - crawler_metadata_regular = get_crawler_splash_status('regular') - - now = datetime.datetime.now() - date = now.strftime("%Y%m%d") - statDomains_onion = get_stats_last_crawled_domains('onion', date) - statDomains_regular = get_stats_last_crawled_domains('regular', date) - - return render_template("Crawler_dashboard.html", crawler_metadata_onion = crawler_metadata_onion, - crawler_enabled=crawler_enabled, date=date, - crawler_metadata_regular=crawler_metadata_regular, - statDomains_onion=statDomains_onion, statDomains_regular=statDomains_regular) +# @hiddenServices.route("/crawlers/", methods=['GET']) +# @login_required +# @login_read_only +# def dashboard(): +# crawler_metadata_onion = get_crawler_splash_status('onion') +# crawler_metadata_regular = get_crawler_splash_status('regular') +# +# now = datetime.datetime.now() +# date = now.strftime("%Y%m%d") +# statDomains_onion = get_stats_last_crawled_domains('onion', date) +# statDomains_regular = get_stats_last_crawled_domains('regular', date) +# +# return render_template("Crawler_dashboard.html", crawler_metadata_onion = crawler_metadata_onion, +# date=date, +# crawler_metadata_regular=crawler_metadata_regular, +# statDomains_onion=statDomains_onion, statDomains_regular=statDomains_regular) @hiddenServices.route("/crawlers/crawler_splash_onion", methods=['GET']) @login_required @@ -288,7 +288,7 @@ def Crawler_Splash_last_by_type(): crawler_metadata = get_crawler_splash_status(type) return render_template("Crawler_Splash_last_by_type.html", type=type, type_name=type_name, - crawler_enabled=crawler_enabled, + is_manager_connected=crawlers.get_splash_manager_connection_metadata(), last_domains=list_domains, statDomains=statDomains, crawler_metadata=crawler_metadata, date_from=date_string, date_to=date_string) @@ -424,7 +424,7 @@ def auto_crawler(): return render_template("Crawler_auto.html", page=page, nb_page_max=nb_page_max, last_domains=last_domains, - crawler_enabled=crawler_enabled, + is_manager_connected=crawlers.get_splash_manager_connection_metadata(), auto_crawler_domain_onions_metadata=auto_crawler_domain_onions_metadata, auto_crawler_domain_regular_metadata=auto_crawler_domain_regular_metadata) @@ -439,23 +439,6 @@ def remove_auto_crawler(): delete_auto_crawler(url) return redirect(url_for('hiddenServices.auto_crawler', page=page)) -@hiddenServices.route("/crawlers/crawler_dashboard_json", methods=['GET']) -@login_required -@login_read_only -def crawler_dashboard_json(): - - crawler_metadata_onion = get_crawler_splash_status('onion') - crawler_metadata_regular = get_crawler_splash_status('regular') - - now = datetime.datetime.now() - date = now.strftime("%Y%m%d") - - statDomains_onion = get_stats_last_crawled_domains('onion', date) - statDomains_regular = get_stats_last_crawled_domains('regular', date) - - return jsonify({'statDomains_onion': statDomains_onion, 'statDomains_regular': statDomains_regular, - 'crawler_metadata_onion':crawler_metadata_onion, 'crawler_metadata_regular':crawler_metadata_regular}) - # # TODO: refractor @hiddenServices.route("/hiddenServices/last_crawled_domains_with_stats_json", methods=['GET']) @login_required diff --git a/var/www/modules/hiddenServices/templates/Crawler_Splash_last_by_type.html b/var/www/modules/hiddenServices/templates/Crawler_Splash_last_by_type.html index f793b3a9..248600e4 100644 --- a/var/www/modules/hiddenServices/templates/Crawler_Splash_last_by_type.html +++ b/var/www/modules/hiddenServices/templates/Crawler_Splash_last_by_type.html @@ -92,30 +92,6 @@
-
-
- Crawlers Status -
-
- - - {% for crawler in crawler_metadata %} - - - - - - {% endfor %} - -
- {{crawler['crawler_info']}} - - {{crawler['crawling_domain']}} - - {{crawler['status_info']}} -
-
-
@@ -189,79 +165,6 @@ function toggle_sidebar(){ } - - - - diff --git a/var/www/modules/hiddenServices/templates/Crawler_dashboard.html b/var/www/templates/crawler/crawler_splash/dashboard_splash_crawler.html similarity index 53% rename from var/www/modules/hiddenServices/templates/Crawler_dashboard.html rename to var/www/templates/crawler/crawler_splash/dashboard_splash_crawler.html index 86c82476..0a80d08c 100644 --- a/var/www/modules/hiddenServices/templates/Crawler_dashboard.html +++ b/var/www/templates/crawler/crawler_splash/dashboard_splash_crawler.html @@ -36,34 +36,15 @@
Onions Crawlers
-
- - - {% for crawler in crawler_metadata_onion %} - - - - - - {% endfor %} - -
- {{crawler['crawler_info']}} - - {{crawler['crawling_domain']}} - - {{crawler['status_info']}} -
-
@@ -73,58 +54,63 @@
Regular Crawlers
-
- - - {% for crawler in crawler_metadata_regular %} - - - - - - {% endfor %} - -
- {{crawler['crawler_info']}} - - {{crawler['crawling_domain']}} - - {{crawler['status_info']}} -
-
- {% include 'domains/block_domains_name_search.html' %} + + + {% for splash_crawler in all_splash_crawler_status %} + + + + + + + {% endfor %} + +
+ {{splash_crawler['crawler_info']}} + + {%if splash_crawler['type']=='onion'%} + + {%else%} + + {%endif%} + + {{splash_crawler['crawling_domain']}} + + {{splash_crawler['status_info']}} +
+ {% include 'domains/block_domains_name_search.html' %} -
-
-
- @@ -176,24 +162,24 @@ function toggle_sidebar(){ function refresh_crawler_status(){ - $.getJSON("{{ url_for('hiddenServices.crawler_dashboard_json') }}", + $.getJSON("{{ url_for('crawler_splash.crawler_dashboard_json') }}", function(data) { - $('#stat_onion_domain_up').text(data.statDomains_onion['domains_up']); - $('#stat_onion_domain_down').text(data.statDomains_onion['domains_down']); - $('#stat_onion_total').text(data.statDomains_onion['total']); - $('#stat_onion_queue').text(data.statDomains_onion['domains_queue']); + $('#stat_onion_domain_up').text(data.splash_crawlers_latest_stats['onion']['domains_up']); + $('#stat_onion_domain_down').text(data.splash_crawlers_latest_stats['onion']['domains_down']); + $('#stat_onion_total').text(data.splash_crawlers_latest_stats['onion']['total']); + $('#stat_onion_queue').text(data.splash_crawlers_latest_stats['onion']['domains_queue']); - $('#stat_regular_domain_up').text(data.statDomains_regular['domains_up']); - $('#stat_regular_domain_down').text(data.statDomains_regular['domains_down']); - $('#stat_regular_total').text(data.statDomains_regular['total']); - $('#stat_regular_queue').text(data.statDomains_regular['domains_queue']); + $('#stat_regular_domain_up').text(data.splash_crawlers_latest_stats['regular']['domains_up']); + $('#stat_regular_domain_down').text(data.splash_crawlers_latest_stats['regular']['domains_down']); + $('#stat_regular_total').text(data.splash_crawlers_latest_stats['regular']['total']); + $('#stat_regular_queue').text(data.splash_crawlers_latest_stats['regular']['domains_queue']); - if(data.crawler_metadata_onion.length!=0){ + if(data.all_splash_crawler_status.length!=0){ $("#tbody_crawler_onion_info").empty(); var tableRef = document.getElementById('tbody_crawler_onion_info'); - for (var i = 0; i < data.crawler_metadata_onion.length; i++) { - var crawler = data.crawler_metadata_onion[i]; + for (var i = 0; i < data.all_splash_crawler_status.length; i++) { + var crawler = data.all_splash_crawler_status[i]; var newRow = tableRef.insertRow(tableRef.rows.length); var text_color; var icon; @@ -205,41 +191,22 @@ function refresh_crawler_status(){ icon = 'times'; } - var newCell = newRow.insertCell(0); - newCell.innerHTML = " "+crawler['crawler_info']+""; - - newCell = newRow.insertCell(1); - newCell.innerHTML = ""+crawler['crawling_domain']+""; - - newCell = newRow.insertCell(2); - newCell.innerHTML = "
"+crawler['status_info']+"
"; - - //$("#panel_crawler").show(); - } - } - if(data.crawler_metadata_regular.length!=0){ - $("#tbody_crawler_regular_info").empty(); - var tableRef = document.getElementById('tbody_crawler_regular_info'); - for (var i = 0; i < data.crawler_metadata_regular.length; i++) { - var crawler = data.crawler_metadata_regular[i]; - var newRow = tableRef.insertRow(tableRef.rows.length); - var text_color; - var icon; - if(crawler['status']){ - text_color = 'Green'; - icon = 'check'; + if(crawler['type'] === 'onion'){ + icon_t = 'fas fa-user-secret'; } else { - text_color = 'Red'; - icon = 'times'; + icon_t = 'fab fa-html5'; } var newCell = newRow.insertCell(0); newCell.innerHTML = " "+crawler['crawler_info']+""; - newCell = newRow.insertCell(1); - newCell.innerHTML = ""+crawler['crawling_domain']+""; + var newCell = newRow.insertCell(1); + newCell.innerHTML = ""; newCell = newRow.insertCell(2); + newCell.innerHTML = ""+crawler['crawling_domain']+""; + + newCell = newRow.insertCell(3); newCell.innerHTML = "
"+crawler['status_info']+"
"; //$("#panel_crawler").show(); diff --git a/var/www/templates/crawler/crawler_splash/settings_edit_crawlers_to_launch.html b/var/www/templates/crawler/crawler_splash/settings_edit_crawlers_to_launch.html new file mode 100644 index 00000000..a9653820 --- /dev/null +++ b/var/www/templates/crawler/crawler_splash/settings_edit_crawlers_to_launch.html @@ -0,0 +1,60 @@ + + + + + AIL-Framework + + + + + + + + + + + + + + + {% include 'nav_bar.html' %} + +
+
+ + {% include 'crawler/menu_sidebar.html' %} + +
+ +
+
Number of Crawlers to Launch:
+ + + {%for crawler_name in nb_crawlers_to_launch%} + + + + + {%endfor%} + +
{{crawler_name}} + +
+ +
+ +
+
+
+ + + + + diff --git a/var/www/templates/crawler/crawler_splash/settings_edit_splash_crawler_manager.html b/var/www/templates/crawler/crawler_splash/settings_edit_splash_crawler_manager.html new file mode 100644 index 00000000..2eca4ba8 --- /dev/null +++ b/var/www/templates/crawler/crawler_splash/settings_edit_splash_crawler_manager.html @@ -0,0 +1,55 @@ + + + + + AIL-Framework + + + + + + + + + + + + + + + {% include 'nav_bar.html' %} + +
+
+ + {% include 'crawler/menu_sidebar.html' %} + +
+ +
+
+ + +
+
+ + +
+ +
+ +
+
+
+ + + + + diff --git a/var/www/templates/crawler/crawler_splash/settings_splash_crawler.html b/var/www/templates/crawler/crawler_splash/settings_splash_crawler.html new file mode 100644 index 00000000..8b1dae72 --- /dev/null +++ b/var/www/templates/crawler/crawler_splash/settings_splash_crawler.html @@ -0,0 +1,299 @@ + + + + + AIL-Framework + + + + + + + + + + + + + + + {% include 'nav_bar.html' %} + +
+
+ + {% include 'crawler/menu_sidebar.html' %} + +
+ +
+
+ + + +
+
+ +
+
+ + +
+
+ + {% if is_manager_connected['status'] %} +
+ + Connected +
+ {% else %} +
+ + Error +
+ {% endif %} +
+

Splash Crawler Manager

+
+
+ + {%if not is_manager_connected['status']%} + {% include 'crawler/crawler_disabled.html' %} + {%endif%} + +
+
+
+
+ + + + + + + + + + + + +
Splash Manager URL{{splash_manager_url}}
API Key + {{api_key}} + + + + + +
+
+
+
+
+ +
+ +
+
+
Number of Crawlers to Launch:
+ + + {%for crawler in nb_crawlers_to_launch%} + + + + + {%endfor%} + +
{{crawler}}{{nb_crawlers_to_launch[crawler]}}
+ + + +
+
+ +
+
+
All Splash Crawlers:
+ + + + + + + + + + {% for splash_name in all_splash %} + + + + + + + + {% endfor %} + +
+ Splash name + + Proxy + + Crawler type + + Description +
+ {{splash_name}} + + {{all_splash[splash_name]['proxy']}} + + {%if all_splash[splash_name]['type']=='tor'%} + + {%else%} + + {%endif%} + {{all_splash[splash_name]['type']}} + + {{all_splash[splash_name]['description']}} + +
+ +
+
+
+
+ +
+
+
All Proxies:
+ + + + + + + + + + + + {% for proxy_name in all_proxies %} + + + + + + + + + + {% endfor %} + +
+ Proxy name + + Host + + Port + + Type + + Crawler Type + + Description +
+ {{proxy_name}} + + {{all_proxies[proxy_name]['host']}} + + {{all_proxies[proxy_name]['port']}} + + {{all_proxies[proxy_name]['type']}} + + {%if all_proxies[proxy_name]['crawler_type']=='tor'%} + + {%else%} + + {%endif%} + {{all_proxies[proxy_name]['crawler_type']}} + + {{all_proxies[proxy_name]['description']}} + +
+ +
+
+
+
+
+
+
+ +
+
+

Crawlers Settings

+
+
+ + + + + + + + + + {% for config_field in crawler_full_config %} + + + + + + + {% endfor %} + +
+ Key + + Description + + Value +
+ {{config_field}} + + {{crawler_full_config[config_field]['info']}} + + {{crawler_full_config[config_field]['value']}} + +
+ +
+
+ +
+
+ +
+
+
+ + + + + diff --git a/var/www/templates/crawler/menu_sidebar.html b/var/www/templates/crawler/menu_sidebar.html index c14abbbe..d3ed9170 100644 --- a/var/www/templates/crawler/menu_sidebar.html +++ b/var/www/templates/crawler/menu_sidebar.html @@ -14,7 +14,7 @@