New on LowEndTalk? Please Register and read our Community Rules.
All new Registrations are manually reviewed and approved, so a short delay after registration may occur before your account becomes active.
All new Registrations are manually reviewed and approved, so a short delay after registration may occur before your account becomes active.
BadLinkFinder - Find bad links on your website and fix them!
4-hour project that I wrote this afternoon and I figured I'd share. Would love some testing
https://github.com/Jonchun/badlinkfinder
I haven't had a chance to write up documentation yet, but usage is pretty simple. Install Python3 and the Requests library.
Run with
python3 -m badlinkfinder --help
If anyone has any tips on how to improve it (even obvious things like adding real logging or just code style/fixes, it would be much appreciated).
Tagging @joepie91 because he'll find his way here somehow anyways, and if I want to get yelled at for bad code, I want it to be from him
Also tagging @raindog308 since he liked my last one.
Comments
Just going to bump this. I did a few rewrites of parts of code and cleaned up a lot of it so that it's in a "usable" state IMO. Here is some sample output:
--include_inbound
as it made it too long )https://pastebin.com/raw/WZEAPmzQ
Tried it on a wordpress site, no bad links found.
Tried it on a phpbb forum, runs "forever" (manually quit it after 30 minutes worth). Forum had about 200 posts, so it's not that.
screamingfrog is pretty much the default tool people use for this kind of thing, in case you feel the need to compare.
Mind PMing me the site? I'd like to see where it's getting stuck. If it's private, you can also run it with
--verbosity=INFO
and it will show you exactly what is queued up and what's happening.Running it with the normal verbosity will show output like this:
++++++++...++.+++..+.............++++...........++++++++++....+....+.....!++!+....++..+++++.+!+++.+++.+..++.++.++.++.+++..+++.+..+++.+..++.+++.+..++.+++..+++.+..++.++.++.+++.+..++.++.++.++.++.+++.+..++.++...+....++.....
where a
+
indicates a new URL being crawled, a!
indicates an error found, and a.
indicates a URL being repeated (ignored). If it keeps spitting output, it means it is still running. This does a FULL recursive crawl of every referenced URL (including assets), so each page can easily have another 50 URLs to search. I'm trying to think of a better way to do this, but not sure if there's another way.Awesome! Thanks. I'll take a look. (It looks like their tool is heavily limited to 500 URLs only, so an open-source alternative would be nice)
I didn't really bother searching to see if anyone else already does this, since it was part-practice, part-fun. Hopefully this will one-day evolve into a solid alternative (will probably have to package it as a Windows app for it to be used by "SEO Specialists" though )
I wouldn't worry about the 'SEO' angle, but certainly people who are actively interested in marketing their website should appreciate what your tool does.
That's what I figure it will be. A cool side tool that should one-day be useful. I've been looking to test it more and more, so I went and did the site in your sig in case you're interested.
Damn, I never noticed the missing font. Thanks
Would like to request a "--ignore-prefix" kind of option. My site uses a lot of custom prefix:url things which obviously don't resolve. Having it as a commandline arg would remove the need to edit the source code for such stuff.
Good idea. https://github.com/Jonchun/badlinkfinder/issues
Yep cool, the site isn't really production ready.
You might want to ignore anchors like #sidebar-dashboard (strip them out from your accumulated link list), think about the base href element and frames if you haven't got it covered already.
Yeap! I think what I work on next is normalizing URLs
e.g.:
After URL normalization, will probably work on saving the website nodegraph to a file so that you can revisit it as necessary or stop the crawler and start it again later.
Will also work on automatically saving the output/found errors into an
--output_file
so that it's easier to work with.@teamacc
Added the
--ignore_prefix
option.https://github.com/Jonchun/badlinkfinder/commit/75394ebf1859ddc91c65663c5a792295abbf33e2
Can this be run inside of my brain?
Doesn't hurt to try.
404 all those cringe moments eh.
Yeah, I've been trying to track down @404error. He's very ... elusive.
A 'fancier' version of a bad link finder is caching the content, maybe using a bit of contextual matching/shingling and checking for content that's drastically changed.
It's useful for folk who link out to external resources, where the resource perhaps expired and points to something entirely different or just simply no longer has the relevant content. It's a lot less trivial than 'is the HTTP code 404' but same utility.
You mean to catch those pesky custom 404 pages?
Make it Python 2 compatible since that is what is installed by default on most VPSs.
No, those would be soft 404 pages (i assume that's what you meant) but sure, that's a relevant improvement. I was meaning more along the lines of parked pages or repurposed expired/auctioned domains.
Is this a good time to lurk?
//slowly backs away...
This could definitely be implemented, although I'm not 100% sure how useful it would be. You do bring up a good point in external links being bad. I will write it so external links are checked for validity, but not scraped for more links.
It would be somewhat trivial to just cache all external resources to do a % similar match and trigger an alert if it is different by more than X%.
Most VPS are also capable of installing Python3 I don't believe in supporting legacy systems when the more current ones are easily obtainable. (It would be a different story if py3 wasn't supported by a lot of systems)
I'm using a WP plugin to find bad and broken links it has a cron job to check my sites all 10,000 posts with internal/external bad and broken links after every 24 hours. @Jonchun you can implement such feature too!
Link rot itself is pretty bad, but also people don't heed the advice Cool URIs don't change - and this even applies to domains. The main thing (for a number of reasons) is that you wouldn't want to link to something that's no longer showing the expected content. Simple example would be a bunch of web hosts who advertise here, some of their domains expired.
wikipedia probably have some useful ideas on the subject as they'll constantly be seeing what circumstances the web throws up.
Fair enough. I can add caching + comparing to the backlog.
Problem with this approach is that they are basically memory and cpu hog!
Nice work @Jonchun.
edit: removed