Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!


BadLinkFinder - Find bad links on your website and fix them!
New on LowEndTalk? Please Register and read our Community Rules.

All new Registrations are manually reviewed and approved, so a short delay after registration may occur before your account becomes active.

BadLinkFinder - Find bad links on your website and fix them!

JonchunJonchun Member
edited April 2018 in General

4-hour project that I wrote this afternoon and I figured I'd share. Would love some testing :)

https://github.com/Jonchun/badlinkfinder

I haven't had a chance to write up documentation yet, but usage is pretty simple. Install Python3 and the Requests library.

Run with
python3 -m badlinkfinder --help

If anyone has any tips on how to improve it (even obvious things like adding real logging or just code style/fixes, it would be much appreciated).

Tagging @joepie91 because he'll find his way here somehow anyways, and if I want to get yelled at for bad code, I want it to be from him :D

Also tagging @raindog308 since he liked my last one.

Comments

  • JonchunJonchun Member
    edited April 2018

    Just going to bump this. I did a few rewrites of parts of code and cleaned up a lot of it so that it's in a "usable" state IMO. Here is some sample output:

    • @doghouch
    • @eva2000 (I didn't include the --include_inbound as it made it too long )

    https://pastebin.com/raw/WZEAPmzQ

    Thanked by 1doghouch
  • Tried it on a wordpress site, no bad links found.

    Tried it on a phpbb forum, runs "forever" (manually quit it after 30 minutes worth). Forum had about 200 posts, so it's not that.

  • screamingfrog is pretty much the default tool people use for this kind of thing, in case you feel the need to compare.

  • JonchunJonchun Member
    edited April 2018

    @teamacc said:
    Tried it on a wordpress site, no bad links found.

    Tried it on a phpbb forum, runs "forever" (manually quit it after 30 minutes worth). Forum had about 200 posts, so it's not that.

    Mind PMing me the site? I'd like to see where it's getting stuck. If it's private, you can also run it with --verbosity=INFO and it will show you exactly what is queued up and what's happening.

    Running it with the normal verbosity will show output like this:
    ++++++++...++.+++..+.............++++...........++++++++++....+....+.....!++!+....++..+++++.+!+++.+++.+..++.++.++.++.+++..+++.+..+++.+..++.+++.+..++.+++..+++.+..++.++.++.+++.+..++.++.++.++.++.+++.+..++.++...+....++.....

    where a + indicates a new URL being crawled, a ! indicates an error found, and a . indicates a URL being repeated (ignored). If it keeps spitting output, it means it is still running. This does a FULL recursive crawl of every referenced URL (including assets), so each page can easily have another 50 URLs to search. I'm trying to think of a better way to do this, but not sure if there's another way.

    @ricardo said:
    screamingfrog is pretty much the default tool people use for this kind of thing, in case you feel the need to compare.

    Awesome! Thanks. I'll take a look. (It looks like their tool is heavily limited to 500 URLs only, so an open-source alternative would be nice)

    I didn't really bother searching to see if anyone else already does this, since it was part-practice, part-fun. Hopefully this will one-day evolve into a solid alternative (will probably have to package it as a Windows app for it to be used by "SEO Specialists" though ;) )

  • I wouldn't worry about the 'SEO' angle, but certainly people who are actively interested in marketing their website should appreciate what your tool does.

    Thanked by 1Jonchun
  • @ricardo said:
    I wouldn't worry about the 'SEO' angle, but certainly people who are actively interested in marketing their website should appreciate what your tool does.

    That's what I figure it will be. A cool side tool that should one-day be useful. I've been looking to test it more and more, so I went and did the site in your sig in case you're interested.

    Here are the issues that were found:
    <SiteError 404 (https://indicina.it/ctl.php?id=229&did=18)>
      Referenced From:
        - https://indicina.it/hosting/provider/buyshared.net.html
        - https://indicina.it/hosting/provider/buyshared.net.html#sidebar-dashboard
    <SiteError 404 (https://indicina.it/ctl.php?item=package&id=1965&did=18)>
      Referenced From:
        - https://indicina.it/hosting/provider/integralhost.net.html#sidebar-dashboard
        - https://indicina.it/hosting/provider/integralhost.net.html
    <SiteError 404 (https://indicina.it/ctl.php?id=1965&did=18)>
      Referenced From:
        - https://indicina.it/hosting/provider/integralhost.net.html#sidebar-dashboard
        - https://indicina.it/hosting/provider/integralhost.net.html
    <SiteError 404 (https://indicina.it/hosting/ss/ipxcore.com.jpg)>
      Referenced From:
        - https://indicina.it/hosting/provider/ipxcore.com.html
        - https://indicina.it/hosting/provider/ipxcore.com.html#sidebar-dashboard
    <SiteError 404 (https://indicina.it/hosting/ss/mrvm.net.jpg)>
      Referenced From:
        - https://indicina.it/hosting/provider/mrvm.net.html#sidebar-dashboard
        - https://indicina.it/hosting/provider/mrvm.net.html
    
  • @Jonchun said:
    Just going to bump this. I did a few rewrites of parts of code and cleaned up a lot of it so that it's in a "usable" state IMO. Here is some sample output:

    Damn, I never noticed the missing font. Thanks <3

    Thanked by 1netomx
  • Would like to request a "--ignore-prefix" kind of option. My site uses a lot of custom prefix:url things which obviously don't resolve. Having it as a commandline arg would remove the need to edit the source code for such stuff.

  • JonchunJonchun Member
    edited April 2018

    @teamacc said:
    Would like to request a "--ignore-prefix" kind of option. My site uses a lot of custom prefix:url things which obviously don't resolve. Having it as a commandline arg would remove the need to edit the source code for such stuff.

    Good idea. https://github.com/Jonchun/badlinkfinder/issues

  • Jonchun said: so I went and did the site in your sig in case you're interested.

    Yep cool, the site isn't really production ready.

    You might want to ignore anchors like #sidebar-dashboard (strip them out from your accumulated link list), think about the base href element and frames if you haven't got it covered already.

  • JonchunJonchun Member
    edited April 2018

    @ricardo said:
    You might want to ignore anchors like #sidebar-dashboard (strip them out from your accumulated link list), think about the base href element and frames if you haven't got it covered already.

    Yeap! I think what I work on next is normalizing URLs
    e.g.:

    After URL normalization, will probably work on saving the website nodegraph to a file so that you can revisit it as necessary or stop the crawler and start it again later.

    Will also work on automatically saving the output/found errors into an --output_file so that it's easier to work with.

    @teamacc
    Added the --ignore_prefix option.
    https://github.com/Jonchun/badlinkfinder/commit/75394ebf1859ddc91c65663c5a792295abbf33e2

      --ignore_prefix IGNORE_PREFIXES
          Ignore prefix when parsing URLs so that it does not detect as invalid.
              --ignore-prefix custom
          will ignore any URL that looks like "custom:nonstandardurlhere.com"
          (You can declare this option multiple times)
    
  • deankdeank Member, Troll

    Can this be run inside of my brain?

  • @deank said:
    Can this be run inside of my brain?

    Doesn't hurt to try.

    Thanked by 1netomx
  • deank said: Can this be run inside of my brain?

    404 all those cringe moments eh.

    Thanked by 1netomx
  • deankdeank Member, Troll

    Yeah, I've been trying to track down @404error. He's very ... elusive.

  • A 'fancier' version of a bad link finder is caching the content, maybe using a bit of contextual matching/shingling and checking for content that's drastically changed.

    It's useful for folk who link out to external resources, where the resource perhaps expired and points to something entirely different or just simply no longer has the relevant content. It's a lot less trivial than 'is the HTTP code 404' but same utility.

  • @ricardo said:
    A 'fancier' version of a bad link finder is caching the content, maybe using a bit of contextual matching/shingling and checking for content that's drastically changed.

    It's useful for folk who link out to external resources, where the resource perhaps expired and points to something entirely different or just simply no longer has the relevant content. It's a lot less trivial than 'is the HTTP code 404' but same utility.

    You mean to catch those pesky custom 404 pages?

  • donlidonli Member

    If anyone has any tips on how to improve it

    Make it Python 2 compatible since that is what is installed by default on most VPSs.

    Thanked by 1jetchirag
  • ricardoricardo Member
    edited April 2018

    teamacc said: You mean to catch those pesky custom 404 pages?

    No, those would be soft 404 pages (i assume that's what you meant) but sure, that's a relevant improvement. I was meaning more along the lines of parked pages or repurposed expired/auctioned domains.

  • @deank said:
    Yeah, I've been trying to track down @404error. He's very ... elusive.

    Is this a good time to lurk?
    //slowly backs away...

  • JonchunJonchun Member
    edited April 2018

    @ricardo said:
    A 'fancier' version of a bad link finder is caching the content, maybe using a bit of contextual matching/shingling and checking for content that's drastically changed.

    This could definitely be implemented, although I'm not 100% sure how useful it would be. You do bring up a good point in external links being bad. I will write it so external links are checked for validity, but not scraped for more links.

    It's useful for folk who link out to external resources, where the resource perhaps expired and points to something entirely different or just simply no longer has the relevant content. It's a lot less trivial than 'is the HTTP code 404' but same utility.

    It would be somewhat trivial to just cache all external resources to do a % similar match and trigger an alert if it is different by more than X%.

    @donli said:

    If anyone has any tips on how to improve it

    Make it Python 2 compatible since that is what is installed by default on most VPSs.

    Most VPS are also capable of installing Python3 :) I don't believe in supporting legacy systems when the more current ones are easily obtainable. (It would be a different story if py3 wasn't supported by a lot of systems)

  • I'm using a WP plugin to find bad and broken links it has a cron job to check my sites all 10,000 posts with internal/external bad and broken links after every 24 hours. @Jonchun you can implement such feature too!

  • ricardoricardo Member
    edited April 2018

    Jonchun said:
    This could definitely be implemented, although I'm not 100% sure how useful it would be

    Link rot itself is pretty bad, but also people don't heed the advice Cool URIs don't change - and this even applies to domains. The main thing (for a number of reasons) is that you wouldn't want to link to something that's no longer showing the expected content. Simple example would be a bunch of web hosts who advertise here, some of their domains expired.

    wikipedia probably have some useful ideas on the subject as they'll constantly be seeing what circumstances the web throws up.

  • @ricardo said:

    Jonchun said:
    This could definitely be implemented, although I'm not 100% sure how useful it would be

    Link rot itself is pretty bad, but also people don't heed the advice Cool URIs don't change - and this even applies to domains. The main thing (for a number of reasons) is that you wouldn't want to link to something that's no longer showing the expected content. Simple example would be a bunch of web hosts who advertise here, some of their domains expired.

    wikipedia probably have some useful ideas on the subject as they'll constantly be seeing what circumstances the web throws up.

    Fair enough. I can add caching + comparing to the backlog.

  • @Sofia_K said:
    I'm using a WP plugin to find bad and broken links it has a cron job to check my sites all 10,000 posts with internal/external bad and broken links after every 24 hours. @Jonchun you can implement such feature too!

    Problem with this approach is that they are basically memory and cpu hog!

  • IonSwitch_StanIonSwitch_Stan Member, Host Rep
    edited April 2018

    Nice work @Jonchun.

    Thanked by 1Jonchun
  • JonchunJonchun Member
    edited April 2018

    edit: removed

Sign In or Register to comment.