BadLinkFinder - Find bad links on your website and fix them!

Jonchun · April 2018

4-hour project that I wrote this afternoon and I figured I'd share. Would love some testing

https://github.com/Jonchun/badlinkfinder

I haven't had a chance to write up documentation yet, but usage is pretty simple. Install Python3 and the Requests library.

Run with
python3 -m badlinkfinder --help

If anyone has any tips on how to improve it (even obvious things like adding real logging or just code style/fixes, it would be much appreciated).

Tagging @joepie91 because he'll find his way here somehow anyways, and if I want to get yelled at for bad code, I want it to be from him

Also tagging @raindog308 since he liked my last one.

Jonchun · April 2018

Just going to bump this. I did a few rewrites of parts of code and cleaned up a lot of it so that it's in a "usable" state IMO. Here is some sample output:

@doghouch
@eva2000 (I didn't include the --include_inbound as it made it too long )

https://pastebin.com/raw/WZEAPmzQ

teamacc · April 2018

Tried it on a wordpress site, no bad links found.

Tried it on a phpbb forum, runs "forever" (manually quit it after 30 minutes worth). Forum had about 200 posts, so it's not that.

ricardo · April 2018

screamingfrog is pretty much the default tool people use for this kind of thing, in case you feel the need to compare.

Jonchun · April 2018

@teamacc said:
Tried it on a wordpress site, no bad links found.

Tried it on a phpbb forum, runs "forever" (manually quit it after 30 minutes worth). Forum had about 200 posts, so it's not that.

Mind PMing me the site? I'd like to see where it's getting stuck. If it's private, you can also run it with --verbosity=INFO and it will show you exactly what is queued up and what's happening.

Running it with the normal verbosity will show output like this:
++++++++...++.+++..+.............++++...........++++++++++....+....+.....!++!+....++..+++++.+!+++.+++.+..++.++.++.++.+++..+++.+..+++.+..++.+++.+..++.+++..+++.+..++.++.++.+++.+..++.++.++.++.++.+++.+..++.++...+....++.....

where a + indicates a new URL being crawled, a ! indicates an error found, and a . indicates a URL being repeated (ignored). If it keeps spitting output, it means it is still running. This does a FULL recursive crawl of every referenced URL (including assets), so each page can easily have another 50 URLs to search. I'm trying to think of a better way to do this, but not sure if there's another way.

@ricardo said:
screamingfrog is pretty much the default tool people use for this kind of thing, in case you feel the need to compare.

Awesome! Thanks. I'll take a look. (It looks like their tool is heavily limited to 500 URLs only, so an open-source alternative would be nice)

I didn't really bother searching to see if anyone else already does this, since it was part-practice, part-fun. Hopefully this will one-day evolve into a solid alternative (will probably have to package it as a Windows app for it to be used by "SEO Specialists" though )

ricardo · April 2018

I wouldn't worry about the 'SEO' angle, but certainly people who are actively interested in marketing their website should appreciate what your tool does.

Jonchun · April 2018

@ricardo said:
I wouldn't worry about the 'SEO' angle, but certainly people who are actively interested in marketing their website should appreciate what your tool does.

That's what I figure it will be. A cool side tool that should one-day be useful. I've been looking to test it more and more, so I went and did the site in your sig in case you're interested.

Here are the issues that were found:
<SiteError 404 (https://indicina.it/ctl.php?id=229&did=18)>
  Referenced From:
    - https://indicina.it/hosting/provider/buyshared.net.html
    - https://indicina.it/hosting/provider/buyshared.net.html#sidebar-dashboard
<SiteError 404 (https://indicina.it/ctl.php?item=package&id=1965&did=18)>
  Referenced From:
    - https://indicina.it/hosting/provider/integralhost.net.html#sidebar-dashboard
    - https://indicina.it/hosting/provider/integralhost.net.html
<SiteError 404 (https://indicina.it/ctl.php?id=1965&did=18)>
  Referenced From:
    - https://indicina.it/hosting/provider/integralhost.net.html#sidebar-dashboard
    - https://indicina.it/hosting/provider/integralhost.net.html
<SiteError 404 (https://indicina.it/hosting/ss/ipxcore.com.jpg)>
  Referenced From:
    - https://indicina.it/hosting/provider/ipxcore.com.html
    - https://indicina.it/hosting/provider/ipxcore.com.html#sidebar-dashboard
<SiteError 404 (https://indicina.it/hosting/ss/mrvm.net.jpg)>
  Referenced From:
    - https://indicina.it/hosting/provider/mrvm.net.html#sidebar-dashboard
    - https://indicina.it/hosting/provider/mrvm.net.html

doghouch · April 2018

@Jonchun said:
Just going to bump this. I did a few rewrites of parts of code and cleaned up a lot of it so that it's in a "usable" state IMO. Here is some sample output:

Damn, I never noticed the missing font. Thanks

teamacc · April 2018

Would like to request a "--ignore-prefix" kind of option. My site uses a lot of custom prefix:url things which obviously don't resolve. Having it as a commandline arg would remove the need to edit the source code for such stuff.

Jonchun · April 2018

@teamacc said:
Would like to request a "--ignore-prefix" kind of option. My site uses a lot of custom prefix:url things which obviously don't resolve. Having it as a commandline arg would remove the need to edit the source code for such stuff.

Good idea. https://github.com/Jonchun/badlinkfinder/issues

ricardo · April 2018

Jonchun said: so I went and did the site in your sig in case you're interested.

Yep cool, the site isn't really production ready.

You might want to ignore anchors like #sidebar-dashboard (strip them out from your accumulated link list), think about the base href element and frames if you haven't got it covered already.

Jonchun · April 2018

@ricardo said:
You might want to ignore anchors like #sidebar-dashboard (strip them out from your accumulated link list), think about the base href element and frames if you haven't got it covered already.

Yeap! I think what I work on next is normalizing URLs
e.g.:

https://www.lowendtalk.com/discussion/comment/2756018#Comment_2756018

https://www.lowendtalk.com//discussion/./comment/2756018#Comment_2756018

https://www.lowendtalk.com//discussion/./comment/2756018

After URL normalization, will probably work on saving the website nodegraph to a file so that you can revisit it as necessary or stop the crawler and start it again later.

Will also work on automatically saving the output/found errors into an --output_file so that it's easier to work with.

@teamacc
Added the --ignore_prefix option.
https://github.com/Jonchun/badlinkfinder/commit/75394ebf1859ddc91c65663c5a792295abbf33e2

  --ignore_prefix IGNORE_PREFIXES
      Ignore prefix when parsing URLs so that it does not detect as invalid.
          --ignore-prefix custom
      will ignore any URL that looks like "custom:nonstandardurlhere.com"
      (You can declare this option multiple times)

deank · April 2018

Can this be run inside of my brain?

Jonchun · April 2018

@deank said:
Can this be run inside of my brain?

Doesn't hurt to try.

ricardo · April 2018

deank said: Can this be run inside of my brain?

404 all those cringe moments eh.

deank · April 2018

Yeah, I've been trying to track down @404error. He's very ... elusive.

ricardo · April 2018

A 'fancier' version of a bad link finder is caching the content, maybe using a bit of contextual matching/shingling and checking for content that's drastically changed.

It's useful for folk who link out to external resources, where the resource perhaps expired and points to something entirely different or just simply no longer has the relevant content. It's a lot less trivial than 'is the HTTP code 404' but same utility.

teamacc · April 2018

@ricardo said:
A 'fancier' version of a bad link finder is caching the content, maybe using a bit of contextual matching/shingling and checking for content that's drastically changed.

It's useful for folk who link out to external resources, where the resource perhaps expired and points to something entirely different or just simply no longer has the relevant content. It's a lot less trivial than 'is the HTTP code 404' but same utility.

You mean to catch those pesky custom 404 pages?

donli · April 2018

If anyone has any tips on how to improve it

Make it Python 2 compatible since that is what is installed by default on most VPSs.

ricardo · April 2018

teamacc said: You mean to catch those pesky custom 404 pages?

No, those would be soft 404 pages (i assume that's what you meant) but sure, that's a relevant improvement. I was meaning more along the lines of parked pages or repurposed expired/auctioned domains.

404error · April 2018

@deank said:
Yeah, I've been trying to track down @404error. He's very ... elusive.

Is this a good time to lurk?
//slowly backs away...

Jonchun · April 2018

@ricardo said:
A 'fancier' version of a bad link finder is caching the content, maybe using a bit of contextual matching/shingling and checking for content that's drastically changed.

This could definitely be implemented, although I'm not 100% sure how useful it would be. You do bring up a good point in external links being bad. I will write it so external links are checked for validity, but not scraped for more links.

It's useful for folk who link out to external resources, where the resource perhaps expired and points to something entirely different or just simply no longer has the relevant content. It's a lot less trivial than 'is the HTTP code 404' but same utility.

It would be somewhat trivial to just cache all external resources to do a % similar match and trigger an alert if it is different by more than X%.

@donli said:

If anyone has any tips on how to improve it

Make it Python 2 compatible since that is what is installed by default on most VPSs.

Most VPS are also capable of installing Python3 I don't believe in supporting legacy systems when the more current ones are easily obtainable. (It would be a different story if py3 wasn't supported by a lot of systems)

Sofia_K · April 2018

I'm using a WP plugin to find bad and broken links it has a cron job to check my sites all 10,000 posts with internal/external bad and broken links after every 24 hours. @Jonchun you can implement such feature too!

ricardo · April 2018

Jonchun said:
This could definitely be implemented, although I'm not 100% sure how useful it would be

Link rot itself is pretty bad, but also people don't heed the advice Cool URIs don't change - and this even applies to domains. The main thing (for a number of reasons) is that you wouldn't want to link to something that's no longer showing the expected content. Simple example would be a bunch of web hosts who advertise here, some of their domains expired.

wikipedia probably have some useful ideas on the subject as they'll constantly be seeing what circumstances the web throws up.

Jonchun · April 2018

@ricardo said:

Jonchun said:
This could definitely be implemented, although I'm not 100% sure how useful it would be

Link rot itself is pretty bad, but also people don't heed the advice Cool URIs don't change - and this even applies to domains. The main thing (for a number of reasons) is that you wouldn't want to link to something that's no longer showing the expected content. Simple example would be a bunch of web hosts who advertise here, some of their domains expired.

wikipedia probably have some useful ideas on the subject as they'll constantly be seeing what circumstances the web throws up.

Fair enough. I can add caching + comparing to the backlog.

jetchirag · April 2018

@Sofia_K said:
I'm using a WP plugin to find bad and broken links it has a cron job to check my sites all 10,000 posts with internal/external bad and broken links after every 24 hours. @Jonchun you can implement such feature too!

Problem with this approach is that they are basically memory and cpu hog!

IonSwitch_Stan · April 2018

Nice work @Jonchun.

Jonchun · April 2018

edit: removed

Howdy, Stranger!

Categories

In this Discussion

BadLinkFinder - Find bad links on your website and fix them!

Comments

https://www.lowendtalk.com/discussion/comment/2756018#Comment_2756018

https://www.lowendtalk.com//discussion/./comment/2756018#Comment_2756018

Howdy, Stranger!

Quick Links

Categories

In this Discussion

BadLinkFinder - Find bad links on your website and fix them!

Comments

https://www.lowendtalk.com/discussion/comment/2756018#Comment_2756018

https://www.lowendtalk.com//discussion/./comment/2756018#Comment_2756018