commit 483f5ee78ba594f63dd1169372481bc1acb90daf
Author: simon <simobilleter@gmail.com>
Date:   Sun Jan 31 14:39:33 2021 +0700

    initial commit, the wp-spider is alive

diff --git a/.gitignore b/.gitignore
new file mode 100644
index 0000000..bcba082
--- /dev/null
+++ b/.gitignore
@@ -0,0 +1,8 @@
+# cache
+__pycache__
+
+# igore config file in use
+config
+
+# ignore csv files created
+*.csv
\ No newline at end of file
diff --git a/README.md b/README.md
new file mode 100644
index 0000000..ec90226
--- /dev/null
+++ b/README.md
@@ -0,0 +1,85 @@
+# wp-spider
+## A spider to go through a Wordpress website.
+
+## Usecase
+Wordpress doesn't make it easy to keep the media library organized. On sites constantly changing, particularly when multiple users are involved, the media library can grow out of control fast. This can result in decreased performance, unnecessary disk storage usage, ever growing backup files and other potential problems in the future.
+
+That's where **wp-spider** comes to the rescue: This Python script will go through every page on your site and compare the pictures visible with your pictures in your media library and then output the result into a set of convenient CSV files for you to further analyze with your favourite Spreadsheet software.
+
+Additionally the spider will also check any link for dead links, going to ressources that don't exist. Same it will check if any images on the site are missing from your library.
+
+**Disclaimer:** Don't run this script against a site you are not the owner of or you don't have permission to do so. Traffic like that might get interpreted as malicious and might result in throttling of your connection or a ban. If you have any measures like that on your site, it might be a good idea to add your IP to the whitelist.
+
+
+## How it works
+The script will start from the *start_url* as defined in the config file, parse the html and look for every link and for every image either via *<img>* tag or via *background-image* CSS property.  
+Then the script will follow every link on the same domain and parse the HTML. Links in the top nav and the footer will only get checked once assuming that the top nav and footer will be identical on every page.  
+For links outside *start_url* the script will make a head request to check the validity of the link.  
+After every visible page has been scraped, the script will look at the sitemap for any sites not indexed yet.  
+Then the script will loop through all pages in your media library by calling the standard wordpress API path *https://www.example.com/wp-json/wp/v2/media* to index all pictures in the library.  
+And as a last step, the script will compare the pictures visible with the pictures in the library, look at all the links and write the result to CSV.
+
+
+## Limitations
+BeautifulSoup needs to be able to parse the page correctly. Content loaded purely over javascript might not be readable by BeautifulSoup.  
+Pages not publicly visible will not get parsed. Same for pages never linked on your site and not listed in the sitemap.  
+Pictures not in use on the website but in use anywhere else like in email publications might show up as *not found*.
+
+
+## Pros and Cons
+There are other similar solutions that try to do the same thing with a different approach namely as a wordpress plugin. That approach is usually based on calling the wordpress database directly. Depending on how the site was built, that might result in a incomplete picture as the plugin in question might not consider that a sitebuilder is using a different database table as expected and therefore will not find the picture in use.  
+That's how this approach is different: If the picture is visible in the HTML it will show as *in use*, so that is an implementation agnostic approach.  
+The downside, additional to the limitations above is, that depending on the amount of pages, library size and more, it can take a long time to go through every page, particularly with a safe *request_timeout* value.
+
+
+## Installation
+Install required none standard Python libraries:
+**requests** to make the HTTP calls, [link](https://pypi.org/project/requests/)
+* On Arch: `sudo pacman -S python-requests`
+* Via Pip: `pip install requests`
+**bs4** aka *BeautifulSoup4* to parse the html, [link](https://pypi.org/project/beautifulsoup4/)
+* On Arch: `sudo pacman -S python-beautifulsoup4`
+* Via Pip: `pip install beautifulsoup4`
+
+
+## Run
+Make sure the `config` file is setup correctly, see bellow.  
+Run the script from the *wp-spider* folder with `./spider.py`.  
+After completion check the *csv* folder for the result.
+
+
+## Interpret the result
+After completion the script will create three CSV files in the csv folder, time stamped on date of completion:
+1. href_list.csv: A list of every link on every page with the following columns:
+    * **page**              : The page the link was found.
+    * **url**               : Where the link is going to.
+    * **local**             : *True* if the link is going to a page on the same domain, *False* if link is going out to another domain.
+    * **href_status_code**  : HTML status code of the link.
+2. img_lib.csv: A list of every image discovered in your wordpress library with the following columns:
+    * **url**               : The shortened direct link to the picture, extend it with *upload_folder* to get the full link.
+    * **found**             : *True* if the picture has been found anywhere on the website, *False* if the picture has not been found anywhere.
+3. img_list.csv: A list of all images discovered anywhere on the website, with the following columns:
+    * **page**              : Page the picture has been found.
+    * **img_short**         : Shortened URL to the picture in the media library.
+    * **img_status_code**   : HTTP status code of the image URL.
+
+From there it is straight forward further analyze the result by filtering the list by pictures not in use, links not resulting in a 200 HTTP response, pictures on the site that don't exist in the library and many other conclusions. 
+
+
+## Config
+Copy or rename the file *config.sample* to *config* and make sure you set all the variables.
+The config file supports the following settings:
+* *start_url*       : Fully qualified URL of the home page of the website to parse. Add *www* if your canonical website uses it to avoid landing in a redirect for every request.
+    * example: `https://www.example.com/`
+* *sitemap_url*     : Link to the sitemap, so pages not linked anywhere but indexed can get parsed too.
+    * example: `https://www.example.com/sitemap_index.xml`
+* *upload_folder*   : Wordpress upload folder where the media library builds the folder tree.
+    * example: `https://www.example.com/wp-content/uploads/` for a default wordpress installation.
+* *valid_img_mime*  : A comma separated list of image [MIME types](https://www.iana.org/assignments/media-types/media-types.xhtml#image) you want to consider as a image to check for its existence. An easy way to exclude files like PDFs or other media files.
+    * example: `image/jpeg, image/png`
+* *top_nav_class*   : The CSS class of the top nav bar so the script doesn't have to recheck these links again for every page.
+    * example: `top-nav-class`
+* *footer_class*    : The CSS class of the footer, to avoid rechecking these links for every page.
+    * example: `footer-class`
+* *request_timeout* : Time out in **seconds** between every request to avoid overloading server resources. Particularly important if the site is hosted on shared hosting and / or if a rate limiter is in place.
+    * example: `30`
diff --git a/config.sample b/config.sample
new file mode 100644
index 0000000..049f942
--- /dev/null
+++ b/config.sample
@@ -0,0 +1,8 @@
+[setup]
+start_url = https://www.example.com/
+sitemap_url = https://www.example/sitemap_index.xml
+upload_folder = https://www.example/wp-content/uploads/
+valid_img_mime = image/jpeg, image/png
+top_nav_class = top-nav-class
+footer_class = footer-class
+request_timeout = 30
\ No newline at end of file
diff --git a/spider.py b/spider.py
new file mode 100755
index 0000000..8e1245c
--- /dev/null
+++ b/spider.py
@@ -0,0 +1,153 @@
+#!/usr/bin/env python3
+""" spiderman """
+
+import configparser
+from time import sleep
+
+import requests
+from bs4 import BeautifulSoup
+
+import src.parse_html as parse_html
+import src.second_stage as second_stage
+import src.process_lists as process_lists
+import src.write_output as write_output
+
+
+def get_config():
+    """ read out the config file and return config dict """
+    # parse
+    config_parser = configparser.ConfigParser()
+    config_parser.read('config')
+    # create dict
+    config = {}
+    config["start_url"] = config_parser.get('setup', "start_url")
+    config["sitemap_url"] = config_parser.get('setup', "sitemap_url")
+    config["upload_folder"] = config_parser.get('setup', "upload_folder")
+    config["top_nav_class"] = config_parser.get('setup', "top_nav_class")
+    config["footer_class"] = config_parser.get('setup', "footer_class")
+    config["request_timeout"] = int(config_parser.get('setup', "request_timeout"))
+    mime_list = config_parser.get('setup', "valid_img_mime").split(',')
+    config["valid_img_mime"] = [mime.strip() for mime in mime_list]
+    return config
+
+
+def main():
+    """ start the whole spider process from here """
+    # get config
+    config = get_config()
+    # controll progress
+    discovered = []
+    indexed = []
+    # main lists to collect results
+    main_img_list = []
+    main_href_list = []
+    # poor man's caching
+    connectivity_cache = []
+    # start with start_url
+    start_url = config['start_url']
+    discovered.append(start_url)
+    page_processing(discovered, indexed, main_img_list,
+               main_href_list, connectivity_cache, config)
+    # add from sitemap and restart
+    second_stage.discover_sitemap(config, discovered)
+    page_processing(discovered, indexed, main_img_list,
+               main_href_list, connectivity_cache, config)
+    # read out library
+    img_lib_main = second_stage.get_media_lib(config)
+    # compare
+    analyzed_img_list = process_lists.img_processing(main_img_list, img_lib_main)
+    # write csv files
+    write_output.write_csv(main_img_list, main_href_list, analyzed_img_list, config)
+
+
+
+def page_processing(discovered, indexed, main_img_list, main_href_list, connectivity_cache, config):
+    """ start the main loop to discover new pages """
+    request_timeout = config['request_timeout']
+    for page in discovered:
+        if page not in indexed:
+            print(f'parsing [{len(indexed)}]/[{len(discovered)}] {page}')
+            img_list, href_list = parse_page(page, connectivity_cache, config)
+            for img in img_list:
+                main_img_list.append(img)
+            for href in href_list:
+                main_href_list.append(href)
+                url = href['url']
+                # add to discovered if al match
+                is_local = href['local']
+                not_discovered = url not in discovered
+                not_hash_link = '#' not in url
+                not_bad_ending = url.lower().split('.')[-1] not in ['pdf', 'jpeg']
+                if is_local and not_discovered and not_hash_link and not_bad_ending:
+                    discovered.append(url)
+            # done
+            indexed.append(page)
+            # take it easy
+            sleep(request_timeout)
+
+
+def connectivity(url, connectivity_cache):
+    """ returns html status code from url """
+    # look if its already in the list
+    already_found = next((item for item in connectivity_cache if item["url"] == url), None)
+    user_agent = ( "Mozilla/5.0 (Windows NT 10.0; Win64; x64), "
+                  "AppleWebKit/537.36 (KHTML, like Gecko) "
+                  "Chrome/70.0.3538.77 Safari/537.36" )
+    headers = { 'User-Agent': user_agent }
+    if not already_found:
+        try:
+            request = requests.head(url, timeout=5, headers=headers)
+            status_code = request.status_code
+            connectivity_cache.append({"url": url, "status_code": status_code})
+        except requests.exceptions.RequestException:
+            print('failed at: ' + url)
+            status_code = 404
+    else:
+        status_code = already_found["status_code"]
+    return status_code
+
+
+def parse_page(page, connectivity_cache, config):
+    """ takes the page url and returns all img and href """
+    request_timeout = config['request_timeout']
+    start_url = config['start_url']
+    upload_folder = config['upload_folder']
+    try:
+        response = requests.get(page)
+    except ConnectionError:
+        sleep(request_timeout)
+        response = requests.get(page)
+    soup = BeautifulSoup(response.text,'lxml')
+    img_url_list = parse_html.get_images(soup, config)
+    # do full scan on homepage, else ignore topnav and footer
+    if page == start_url:
+        href_url_list = parse_html.get_hrefs(soup, home_pass=False)
+    else:
+        href_url_list = parse_html.get_hrefs(soup)
+    # parse imgs
+    img_list = []
+    for url in img_url_list:
+        img_short = url.lstrip(upload_folder)
+        img_status_code = connectivity(url, connectivity_cache)
+        img_line_dict = {}
+        img_line_dict["page"] = page
+        img_line_dict["img_short"] = img_short
+        img_line_dict["img_status_code"] = img_status_code
+        img_list.append(img_line_dict)
+    # parse hrefs
+    href_list = []
+    for url in href_url_list:
+        href_status_code = connectivity(url, connectivity_cache)
+        local = bool(url.startswith(start_url.rstrip('/')))
+        href_line_dict = {}
+        href_line_dict["page"] = page
+        href_line_dict["url"] = url
+        href_line_dict["local"] = local
+        href_line_dict["href_status_code"] = href_status_code
+        href_list.append(href_line_dict)
+    return img_list, href_list
+
+
+# lunch from here
+if __name__ == '__main__':
+    main()
diff --git a/src/parse_html.py b/src/parse_html.py
new file mode 100644
index 0000000..601dab8
--- /dev/null
+++ b/src/parse_html.py
@@ -0,0 +1,75 @@
+""" parses and processes the html for each page """
+
+from time import sleep
+import requests
+
+def get_hrefs(soup, home_pass=True):
+    """ takes the soup and returns all href found
+    excludes # links and links to jpg files """
+    url_set = set()
+    # loop through soup
+    all_links = soup.find_all("a")
+    for link in all_links:
+        try:
+            url = link["href"]
+        except KeyError:
+            continue
+        if url.startswith('http') and not url.endswith('#') and not url.lower().endswith('.jpg'):
+            url_set.add(url)
+    href_url_list = list(url_set)
+    # remove top nav items if not homepage
+    if home_pass:
+        # split soop
+        soup_nav = soup.find("nav", {"role": "navigation"})
+        soup_footer = soup.find("div", {"class": "elementor-location-footer"})
+        # get links links
+        try:
+            all_nav_links = list(set([x["href"] for x in soup_nav.find_all("a")]))
+            all_footer_links = list(set([x["href"] for x in soup_footer.find_all("a")]))
+            href_url_list = [link for link in href_url_list if link not in all_nav_links]
+            href_url_list = [link for link in href_url_list if link not in all_footer_links]
+        except:
+            pass
+    href_url_list.sort()
+    return href_url_list
+
+
+def get_images(soup, config):
+    """ takes the soup and returns all images from
+    img html tags, inline css and external CSS files """
+    upload_folder = config['upload_folder']
+    request_timeout = config['request_timeout']
+    img_url_set = set()
+    # from img tag
+    all_imgs = soup.find_all("img")
+    for img in all_imgs:
+        url = img["src"]
+        if upload_folder in url:
+            img_url_set.add(url)
+    # from inline style tag
+    all_divs = soup.find_all('div')
+    for div in all_divs:
+        try:
+            style = div["style"]
+            if 'background-image' in style:
+                url = style.split('(')[1].split(')')[0]
+                img_url_set.add(url)
+        except:
+            continue
+    # external
+    all_external_css = soup.find_all("link", {"rel": "stylesheet"})
+    for css_file in all_external_css:
+        remote_file = all_external_css[0]["href"]
+        try:
+            remote_css = requests.get(remote_file).text
+        except ConnectionError:
+            sleep(request_timeout)
+            remote_css = requests.get(remote_file).text
+        css_rules = remote_css.split(';')
+        for rule in css_rules:
+            if upload_folder in rule:
+                url = rule.split('(')[1].split(')')[0]
+                img_url_set.add(url)
+    img_url_list = list(img_url_set)
+    img_url_list.sort()
+    return img_url_list
diff --git a/src/process_lists.py b/src/process_lists.py
new file mode 100644
index 0000000..f288603
--- /dev/null
+++ b/src/process_lists.py
@@ -0,0 +1,26 @@
+""" processing lists """
+
+def img_processing(main_img_list, img_lib_main):
+    """ takes the main_img_list and replace url with url from library """
+    # loop through every picture
+    for img_found in enumerate(main_img_list):
+        index = img_found[0]
+        search_url = img_found[1]['img_short']
+        for img_in_lib in img_lib_main:
+            for size in img_in_lib['sizes']:
+                if size == search_url:
+                    found_url = img_in_lib['main']
+        main_img_list[index]['img_short'] = found_url
+    # check if img is used
+    main_img_list_short = [img['img_short'] for img in main_img_list]
+    analyzed_img_list = []
+    # loop through all imgs in library
+    for img_lib in img_lib_main:
+        main_url_lib = img_lib['main']
+        # check if in use
+        found = bool(main_url_lib in main_img_list_short)
+        img_dict = {}
+        img_dict["url"] = main_url_lib
+        img_dict["found"] = found
+        analyzed_img_list.append(img_dict)
+    return analyzed_img_list
diff --git a/src/second_stage.py b/src/second_stage.py
new file mode 100644
index 0000000..248a54e
--- /dev/null
+++ b/src/second_stage.py
@@ -0,0 +1,86 @@
+""" collection of functions to gather additional information as a second stage """
+
+import json
+from time import sleep
+
+import requests
+from bs4 import BeautifulSoup
+
+
+def discover_sitemap(config, discovered):
+    """ returns a list of pages indexed in the sitemap """
+    sitemap_url = config['sitemap_url']
+    request_timeout = config['request_timeout']
+    # get main
+    print("look at sitemap")
+    try:
+        response = requests.get(sitemap_url)
+    except ConnectionError:
+        sleep(request_timeout)
+        response = requests.get(sitemap_url)
+    xml = response.text
+    soup = BeautifulSoup(xml, features="lxml")
+    sitemap_tags = soup.find_all("sitemap")
+    sitemap_list = [map.findNext("loc").text for map in sitemap_tags]
+    # loop through all list and get map by map
+    all_sitemap_pages = []
+    for sitemap in sitemap_list:
+        try:
+            response = requests.get(sitemap)
+        except ConnectionError:
+            sleep(request_timeout)
+            response = requests.get(sitemap)
+        xml = response.text
+        soup = BeautifulSoup(xml, features="lxml")
+        page_tags = soup.find_all("url")
+        page_list = [map.findNext("loc").text for map in page_tags]
+        # add every page to list
+        for page in page_list:
+            all_sitemap_pages.append(page)
+    # sort and return
+    all_sitemap_pages.sort()
+    # add to discovered list if new
+    discovered = [discovered.append(page) for page in all_sitemap_pages if page not in discovered]
+
+
+
+def get_media_lib(config):
+    """ returns a list of dics of media files in library """
+    # first call
+    start_url = config['start_url']
+    valid_img_mime = config['valid_img_mime']
+    request_timeout = config['request_timeout']
+    upload_folder = config['upload_folder']
+    try:
+        response = requests.get(start_url + '/wp-json/wp/v2/media?per_page=100&page=1')
+    except ConnectionError:
+        sleep(request_timeout)
+        response = requests.get(start_url + '/wp-json/wp/v2/media?per_page=100&page=1')
+    total_pages = int(response.headers['X-WP-TotalPages'])
+    img_lib_main = []
+    # loop through pages
+    for page in range(total_pages):
+        page_nr = str(page + 1)
+        print(f'parsing page {page_nr}/{total_pages}')
+        try:
+            response = requests.get(start_url + '/wp-json/wp/v2/media?per_page=100&page=' + page_nr)
+        except ConnectionError:
+            sleep(request_timeout)
+            response = requests.get(start_url + '/wp-json/wp/v2/media?per_page=100&page=' + page_nr)
+        img_json_list = json.loads(response.text)
+        for img in img_json_list:
+            mime_type = img['mime_type']
+            if mime_type in valid_img_mime:
+                img_dict = {}
+                img_dict['main'] = img['media_details']['file']
+                all_sizes = img['media_details']['sizes']
+                sizes_list = []
+                for size in all_sizes.values():
+                    url = size['source_url'].lstrip(upload_folder)
+                    sizes_list.append(url)
+                img_dict['sizes'] = sizes_list
+                img_lib_main.append(img_dict)
+        # take it easy
+        sleep(request_timeout)
+    # return list at end
+    return img_lib_main
diff --git a/src/write_output.py b/src/write_output.py
new file mode 100644
index 0000000..bc557bd
--- /dev/null
+++ b/src/write_output.py
@@ -0,0 +1,35 @@
+""" write csv output files """
+
+from time import strftime
+import csv
+
+def write_csv(main_img_list, main_href_list, analyzed_img_list, config):
+    """ takes the list and writes proper csv files for further processing """
+    start_url = config['start_url']
+    timestamp = strftime('%Y-%m-%d')
+    domain = start_url.split('//')[-1].split('/')[0].lstrip('www.').split('.')[0]
+    filename = f'{domain}_{timestamp}_'
+    # write main image csv
+    with open('csv/' + filename + 'img_list.csv', 'w', newline='') as csvfile:
+        fieldnames = ['page', 'img_short', 'img_status_code']
+        writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
+        # write
+        writer.writeheader()
+        for row in main_img_list:
+            writer.writerow(row)
+    # write image library csv
+    with open('csv/' + filename + 'img_lib.csv', 'w', newline='') as csvfile:
+        fieldnames = ['url', 'found']
+        writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
+        # write
+        writer.writeheader()
+        for row in analyzed_img_list:
+            writer.writerow(row)
+    # write href csv
+    with open('csv/' + filename + 'href_list.csv', 'w', newline='') as csvfile:
+        fieldnames = ['page', 'url', 'local', 'href_status_code']
+        writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
+        # write
+        writer.writeheader()
+        for row in main_href_list:
+            writer.writerow(row)