initial commit, the wp-spider is alive

This commit is contained in:
simon 2021-01-31 14:39:33 +07:00
commit 483f5ee78b
8 changed files with 476 additions and 0 deletions

8
.gitignore vendored Normal file
View File

@ -0,0 +1,8 @@
# cache
__pycache__
# igore config file in use
config
# ignore csv files created
*.csv

85
README.md Normal file
View File

@ -0,0 +1,85 @@
# wp-spider
## A spider to go through a Wordpress website.
## Usecase
Wordpress doesn't make it easy to keep the media library organized. On sites constantly changing, particularly when multiple users are involved, the media library can grow out of control fast. This can result in decreased performance, unnecessary disk storage usage, ever growing backup files and other potential problems in the future.
That's where **wp-spider** comes to the rescue: This Python script will go through every page on your site and compare the pictures visible with your pictures in your media library and then output the result into a set of convenient CSV files for you to further analyze with your favourite Spreadsheet software.
Additionally the spider will also check any link for dead links, going to ressources that don't exist. Same it will check if any images on the site are missing from your library.
**Disclaimer:** Don't run this script against a site you are not the owner of or you don't have permission to do so. Traffic like that might get interpreted as malicious and might result in throttling of your connection or a ban. If you have any measures like that on your site, it might be a good idea to add your IP to the whitelist.
## How it works
The script will start from the *start_url* as defined in the config file, parse the html and look for every link and for every image either via *<img>* tag or via *background-image* CSS property.
Then the script will follow every link on the same domain and parse the HTML. Links in the top nav and the footer will only get checked once assuming that the top nav and footer will be identical on every page.
For links outside *start_url* the script will make a head request to check the validity of the link.
After every visible page has been scraped, the script will look at the sitemap for any sites not indexed yet.
Then the script will loop through all pages in your media library by calling the standard wordpress API path *https://www.example.com/wp-json/wp/v2/media* to index all pictures in the library.
And as a last step, the script will compare the pictures visible with the pictures in the library, look at all the links and write the result to CSV.
## Limitations
BeautifulSoup needs to be able to parse the page correctly. Content loaded purely over javascript might not be readable by BeautifulSoup.
Pages not publicly visible will not get parsed. Same for pages never linked on your site and not listed in the sitemap.
Pictures not in use on the website but in use anywhere else like in email publications might show up as *not found*.
## Pros and Cons
There are other similar solutions that try to do the same thing with a different approach namely as a wordpress plugin. That approach is usually based on calling the wordpress database directly. Depending on how the site was built, that might result in a incomplete picture as the plugin in question might not consider that a sitebuilder is using a different database table as expected and therefore will not find the picture in use.
That's how this approach is different: If the picture is visible in the HTML it will show as *in use*, so that is an implementation agnostic approach.
The downside, additional to the limitations above is, that depending on the amount of pages, library size and more, it can take a long time to go through every page, particularly with a safe *request_timeout* value.
## Installation
Install required none standard Python libraries:
**requests** to make the HTTP calls, [link](https://pypi.org/project/requests/)
* On Arch: `sudo pacman -S python-requests`
* Via Pip: `pip install requests`
**bs4** aka *BeautifulSoup4* to parse the html, [link](https://pypi.org/project/beautifulsoup4/)
* On Arch: `sudo pacman -S python-beautifulsoup4`
* Via Pip: `pip install beautifulsoup4`
## Run
Make sure the `config` file is setup correctly, see bellow.
Run the script from the *wp-spider* folder with `./spider.py`.
After completion check the *csv* folder for the result.
## Interpret the result
After completion the script will create three CSV files in the csv folder, time stamped on date of completion:
1. href_list.csv: A list of every link on every page with the following columns:
* **page** : The page the link was found.
* **url** : Where the link is going to.
* **local** : *True* if the link is going to a page on the same domain, *False* if link is going out to another domain.
* **href_status_code** : HTML status code of the link.
2. img_lib.csv: A list of every image discovered in your wordpress library with the following columns:
* **url** : The shortened direct link to the picture, extend it with *upload_folder* to get the full link.
* **found** : *True* if the picture has been found anywhere on the website, *False* if the picture has not been found anywhere.
3. img_list.csv: A list of all images discovered anywhere on the website, with the following columns:
* **page** : Page the picture has been found.
* **img_short** : Shortened URL to the picture in the media library.
* **img_status_code** : HTTP status code of the image URL.
From there it is straight forward further analyze the result by filtering the list by pictures not in use, links not resulting in a 200 HTTP response, pictures on the site that don't exist in the library and many other conclusions.
## Config
Copy or rename the file *config.sample* to *config* and make sure you set all the variables.
The config file supports the following settings:
* *start_url* : Fully qualified URL of the home page of the website to parse. Add *www* if your canonical website uses it to avoid landing in a redirect for every request.
* example: `https://www.example.com/`
* *sitemap_url* : Link to the sitemap, so pages not linked anywhere but indexed can get parsed too.
* example: `https://www.example.com/sitemap_index.xml`
* *upload_folder* : Wordpress upload folder where the media library builds the folder tree.
* example: `https://www.example.com/wp-content/uploads/` for a default wordpress installation.
* *valid_img_mime* : A comma separated list of image [MIME types](https://www.iana.org/assignments/media-types/media-types.xhtml#image) you want to consider as a image to check for its existence. An easy way to exclude files like PDFs or other media files.
* example: `image/jpeg, image/png`
* *top_nav_class* : The CSS class of the top nav bar so the script doesn't have to recheck these links again for every page.
* example: `top-nav-class`
* *footer_class* : The CSS class of the footer, to avoid rechecking these links for every page.
* example: `footer-class`
* *request_timeout* : Time out in **seconds** between every request to avoid overloading server resources. Particularly important if the site is hosted on shared hosting and / or if a rate limiter is in place.
* example: `30`

8
config.sample Normal file
View File

@ -0,0 +1,8 @@
[setup]
start_url = https://www.example.com/
sitemap_url = https://www.example/sitemap_index.xml
upload_folder = https://www.example/wp-content/uploads/
valid_img_mime = image/jpeg, image/png
top_nav_class = top-nav-class
footer_class = footer-class
request_timeout = 30

153
spider.py Executable file
View File

@ -0,0 +1,153 @@
#!/usr/bin/env python3
""" spiderman """
import configparser
from time import sleep
import requests
from bs4 import BeautifulSoup
import src.parse_html as parse_html
import src.second_stage as second_stage
import src.process_lists as process_lists
import src.write_output as write_output
def get_config():
""" read out the config file and return config dict """
# parse
config_parser = configparser.ConfigParser()
config_parser.read('config')
# create dict
config = {}
config["start_url"] = config_parser.get('setup', "start_url")
config["sitemap_url"] = config_parser.get('setup', "sitemap_url")
config["upload_folder"] = config_parser.get('setup', "upload_folder")
config["top_nav_class"] = config_parser.get('setup', "top_nav_class")
config["footer_class"] = config_parser.get('setup', "footer_class")
config["request_timeout"] = int(config_parser.get('setup', "request_timeout"))
mime_list = config_parser.get('setup', "valid_img_mime").split(',')
config["valid_img_mime"] = [mime.strip() for mime in mime_list]
return config
def main():
""" start the whole spider process from here """
# get config
config = get_config()
# controll progress
discovered = []
indexed = []
# main lists to collect results
main_img_list = []
main_href_list = []
# poor man's caching
connectivity_cache = []
# start with start_url
start_url = config['start_url']
discovered.append(start_url)
page_processing(discovered, indexed, main_img_list,
main_href_list, connectivity_cache, config)
# add from sitemap and restart
second_stage.discover_sitemap(config, discovered)
page_processing(discovered, indexed, main_img_list,
main_href_list, connectivity_cache, config)
# read out library
img_lib_main = second_stage.get_media_lib(config)
# compare
analyzed_img_list = process_lists.img_processing(main_img_list, img_lib_main)
# write csv files
write_output.write_csv(main_img_list, main_href_list, analyzed_img_list, config)
def page_processing(discovered, indexed, main_img_list, main_href_list, connectivity_cache, config):
""" start the main loop to discover new pages """
request_timeout = config['request_timeout']
for page in discovered:
if page not in indexed:
print(f'parsing [{len(indexed)}]/[{len(discovered)}] {page}')
img_list, href_list = parse_page(page, connectivity_cache, config)
for img in img_list:
main_img_list.append(img)
for href in href_list:
main_href_list.append(href)
url = href['url']
# add to discovered if al match
is_local = href['local']
not_discovered = url not in discovered
not_hash_link = '#' not in url
not_bad_ending = url.lower().split('.')[-1] not in ['pdf', 'jpeg']
if is_local and not_discovered and not_hash_link and not_bad_ending:
discovered.append(url)
# done
indexed.append(page)
# take it easy
sleep(request_timeout)
def connectivity(url, connectivity_cache):
""" returns html status code from url """
# look if its already in the list
already_found = next((item for item in connectivity_cache if item["url"] == url), None)
user_agent = ( "Mozilla/5.0 (Windows NT 10.0; Win64; x64), "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/70.0.3538.77 Safari/537.36" )
headers = { 'User-Agent': user_agent }
if not already_found:
try:
request = requests.head(url, timeout=5, headers=headers)
status_code = request.status_code
connectivity_cache.append({"url": url, "status_code": status_code})
except requests.exceptions.RequestException:
print('failed at: ' + url)
status_code = 404
else:
status_code = already_found["status_code"]
return status_code
def parse_page(page, connectivity_cache, config):
""" takes the page url and returns all img and href """
request_timeout = config['request_timeout']
start_url = config['start_url']
upload_folder = config['upload_folder']
try:
response = requests.get(page)
except ConnectionError:
sleep(request_timeout)
response = requests.get(page)
soup = BeautifulSoup(response.text,'lxml')
img_url_list = parse_html.get_images(soup, config)
# do full scan on homepage, else ignore topnav and footer
if page == start_url:
href_url_list = parse_html.get_hrefs(soup, home_pass=False)
else:
href_url_list = parse_html.get_hrefs(soup)
# parse imgs
img_list = []
for url in img_url_list:
img_short = url.lstrip(upload_folder)
img_status_code = connectivity(url, connectivity_cache)
img_line_dict = {}
img_line_dict["page"] = page
img_line_dict["img_short"] = img_short
img_line_dict["img_status_code"] = img_status_code
img_list.append(img_line_dict)
# parse hrefs
href_list = []
for url in href_url_list:
href_status_code = connectivity(url, connectivity_cache)
local = bool(url.startswith(start_url.rstrip('/')))
href_line_dict = {}
href_line_dict["page"] = page
href_line_dict["url"] = url
href_line_dict["local"] = local
href_line_dict["href_status_code"] = href_status_code
href_list.append(href_line_dict)
return img_list, href_list
# lunch from here
if __name__ == '__main__':
main()

75
src/parse_html.py Normal file
View File

@ -0,0 +1,75 @@
""" parses and processes the html for each page """
from time import sleep
import requests
def get_hrefs(soup, home_pass=True):
""" takes the soup and returns all href found
excludes # links and links to jpg files """
url_set = set()
# loop through soup
all_links = soup.find_all("a")
for link in all_links:
try:
url = link["href"]
except KeyError:
continue
if url.startswith('http') and not url.endswith('#') and not url.lower().endswith('.jpg'):
url_set.add(url)
href_url_list = list(url_set)
# remove top nav items if not homepage
if home_pass:
# split soop
soup_nav = soup.find("nav", {"role": "navigation"})
soup_footer = soup.find("div", {"class": "elementor-location-footer"})
# get links links
try:
all_nav_links = list(set([x["href"] for x in soup_nav.find_all("a")]))
all_footer_links = list(set([x["href"] for x in soup_footer.find_all("a")]))
href_url_list = [link for link in href_url_list if link not in all_nav_links]
href_url_list = [link for link in href_url_list if link not in all_footer_links]
except:
pass
href_url_list.sort()
return href_url_list
def get_images(soup, config):
""" takes the soup and returns all images from
img html tags, inline css and external CSS files """
upload_folder = config['upload_folder']
request_timeout = config['request_timeout']
img_url_set = set()
# from img tag
all_imgs = soup.find_all("img")
for img in all_imgs:
url = img["src"]
if upload_folder in url:
img_url_set.add(url)
# from inline style tag
all_divs = soup.find_all('div')
for div in all_divs:
try:
style = div["style"]
if 'background-image' in style:
url = style.split('(')[1].split(')')[0]
img_url_set.add(url)
except:
continue
# external
all_external_css = soup.find_all("link", {"rel": "stylesheet"})
for css_file in all_external_css:
remote_file = all_external_css[0]["href"]
try:
remote_css = requests.get(remote_file).text
except ConnectionError:
sleep(request_timeout)
remote_css = requests.get(remote_file).text
css_rules = remote_css.split(';')
for rule in css_rules:
if upload_folder in rule:
url = rule.split('(')[1].split(')')[0]
img_url_set.add(url)
img_url_list = list(img_url_set)
img_url_list.sort()
return img_url_list

26
src/process_lists.py Normal file
View File

@ -0,0 +1,26 @@
""" processing lists """
def img_processing(main_img_list, img_lib_main):
""" takes the main_img_list and replace url with url from library """
# loop through every picture
for img_found in enumerate(main_img_list):
index = img_found[0]
search_url = img_found[1]['img_short']
for img_in_lib in img_lib_main:
for size in img_in_lib['sizes']:
if size == search_url:
found_url = img_in_lib['main']
main_img_list[index]['img_short'] = found_url
# check if img is used
main_img_list_short = [img['img_short'] for img in main_img_list]
analyzed_img_list = []
# loop through all imgs in library
for img_lib in img_lib_main:
main_url_lib = img_lib['main']
# check if in use
found = bool(main_url_lib in main_img_list_short)
img_dict = {}
img_dict["url"] = main_url_lib
img_dict["found"] = found
analyzed_img_list.append(img_dict)
return analyzed_img_list

86
src/second_stage.py Normal file
View File

@ -0,0 +1,86 @@
""" collection of functions to gather additional information as a second stage """
import json
from time import sleep
import requests
from bs4 import BeautifulSoup
def discover_sitemap(config, discovered):
""" returns a list of pages indexed in the sitemap """
sitemap_url = config['sitemap_url']
request_timeout = config['request_timeout']
# get main
print("look at sitemap")
try:
response = requests.get(sitemap_url)
except ConnectionError:
sleep(request_timeout)
response = requests.get(sitemap_url)
xml = response.text
soup = BeautifulSoup(xml, features="lxml")
sitemap_tags = soup.find_all("sitemap")
sitemap_list = [map.findNext("loc").text for map in sitemap_tags]
# loop through all list and get map by map
all_sitemap_pages = []
for sitemap in sitemap_list:
try:
response = requests.get(sitemap)
except ConnectionError:
sleep(request_timeout)
response = requests.get(sitemap)
xml = response.text
soup = BeautifulSoup(xml, features="lxml")
page_tags = soup.find_all("url")
page_list = [map.findNext("loc").text for map in page_tags]
# add every page to list
for page in page_list:
all_sitemap_pages.append(page)
# sort and return
all_sitemap_pages.sort()
# add to discovered list if new
discovered = [discovered.append(page) for page in all_sitemap_pages if page not in discovered]
def get_media_lib(config):
""" returns a list of dics of media files in library """
# first call
start_url = config['start_url']
valid_img_mime = config['valid_img_mime']
request_timeout = config['request_timeout']
upload_folder = config['upload_folder']
try:
response = requests.get(start_url + '/wp-json/wp/v2/media?per_page=100&page=1')
except ConnectionError:
sleep(request_timeout)
response = requests.get(start_url + '/wp-json/wp/v2/media?per_page=100&page=1')
total_pages = int(response.headers['X-WP-TotalPages'])
img_lib_main = []
# loop through pages
for page in range(total_pages):
page_nr = str(page + 1)
print(f'parsing page {page_nr}/{total_pages}')
try:
response = requests.get(start_url + '/wp-json/wp/v2/media?per_page=100&page=' + page_nr)
except ConnectionError:
sleep(request_timeout)
response = requests.get(start_url + '/wp-json/wp/v2/media?per_page=100&page=' + page_nr)
img_json_list = json.loads(response.text)
for img in img_json_list:
mime_type = img['mime_type']
if mime_type in valid_img_mime:
img_dict = {}
img_dict['main'] = img['media_details']['file']
all_sizes = img['media_details']['sizes']
sizes_list = []
for size in all_sizes.values():
url = size['source_url'].lstrip(upload_folder)
sizes_list.append(url)
img_dict['sizes'] = sizes_list
img_lib_main.append(img_dict)
# take it easy
sleep(request_timeout)
# return list at end
return img_lib_main

35
src/write_output.py Normal file
View File

@ -0,0 +1,35 @@
""" write csv output files """
from time import strftime
import csv
def write_csv(main_img_list, main_href_list, analyzed_img_list, config):
""" takes the list and writes proper csv files for further processing """
start_url = config['start_url']
timestamp = strftime('%Y-%m-%d')
domain = start_url.split('//')[-1].split('/')[0].lstrip('www.').split('.')[0]
filename = f'{domain}_{timestamp}_'
# write main image csv
with open('csv/' + filename + 'img_list.csv', 'w', newline='') as csvfile:
fieldnames = ['page', 'img_short', 'img_status_code']
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
# write
writer.writeheader()
for row in main_img_list:
writer.writerow(row)
# write image library csv
with open('csv/' + filename + 'img_lib.csv', 'w', newline='') as csvfile:
fieldnames = ['url', 'found']
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
# write
writer.writeheader()
for row in analyzed_img_list:
writer.writerow(row)
# write href csv
with open('csv/' + filename + 'href_list.csv', 'w', newline='') as csvfile:
fieldnames = ['page', 'url', 'local', 'href_status_code']
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
# write
writer.writeheader()
for row in main_href_list:
writer.writerow(row)