In revisiting some of my older posts, I noticed that some of the links were broken. Well, I can’t have that.
So, I wrote a tool that will crawl each article or directory of articles, gather all the links, and then make a
HEAD request to each one. I wrote it in Go because of its concurrency model, implemented by goroutines, since the tool would be making tens, hundreds, perhaps thousands of
HEAD requests over its little lifetime.
Before I go any further, I wanted to say that it’s not the greatest tool I’ve ever written, but it’s not too shabby, either. It accomplishes my main goals, which is good enough for now.
Here are the design goals:
- Should be a CLI tool and easy to run as a Git
pre-commithook and in CI pipelines.
- Should have a modicum of configuration. At the very least, the user should be able to provide:
- Custom headers
- Link regular expression
- Skip pattern for links
The design goals have been met, it’s a working tool, and I’m happy. I’ll continue to revisit it, because I rarely write something from which I permanently move on. Since I’m always trying to get better at what I do, I’m sure I’ll come back to it and think to myself, “well, this certainly is a pile!”.
In the meantime, it’s working title is
link-scanner, and it is up on my GitHub.
Here is the current usage:
$ ./link-scanner -h Usage of ./link-scanner: -dir string Optional. Searches every file in the directory for a match. Non-recursive. -filename string Optional. Takes precedence over directory searches. -filetype .html Only searches files of this type. Include the period, i.e., .html (default ".md") -header string Optional. Takes comma-delimited pairs of key:value -q Optional. Turns on quiet mode. -regex string Optional. The search pattern (regex) used when gathering the links in an article. (default "(?:https?:\\/\\/[^<>].*\\.[^\\W\\s)\"<>]+[\\w\\.,$'%\\-/?=]*?)$") -skipPattern string Optional. Will skip any gathered links matching this pattern. (default "\\.onion|example\\.com") -v Optional. Turns on verbose mode.
Since I wrote this primarily to shoehorn into daily use, let’s now briefly take a look at the three ways I’ve implemented it into my workflow as a:
Testing all the links in a particular file:
$ link-scanner -filename gpg.md
Testing all the links in a particular directory with custom headers:
$ link-scanner -dir content/post -header "User-Agent:Mozilla/5.0,Content-Type:application/json"
Git pre-commit hook
In my breathtaking article on how I use the Git
pre-commit hook, I explicate its setup, so I won’t go over it now in full.
Briefly, you could uncomment one of more of the Git hooks specified in the
install.sh script in the
dotfiles repository, change directory to the location of your local repository, and then run that script.
Or, copy both the
pre-commit runner and the
pre-commit.d/link-scanner.sh into your
.git directory in the top-level of your repository in the
In the root of the repository:
$ cd ./.git/hooks $ wget --no-clobber https://github.com/btoll/dotfiles/blob/master/git-hub/hooks/pre-commit $ wget --no-clobber --directory-prefix pre-commit.d https://github.com/btoll/dotfiles/blob/master/git-hub/hooks/pre-commit.d/link-scanner.sh
This script will automatically run when committing a Git object (
git commit). To disable, simply add the
$ git commit --no-verify -am 'derpy'
The GitHub Action is very straightforward and probably doesn’t need any explanation, as it’s immediately identifiable by its similarity to the obnoxious number of “IaC” cloud CI/CD tools that the cool kids are constantly yammering on about.
I used GitHub Actions because I just did a short stint at GitHub as a consultant and liked the experience. Also, although I’m not doing it here, I like that you can “bring your own” agent. I know of at least one other platform that does this (Buildkite), but since all of my repositories are hosted by GitHub, I’ll use GitHub Actions.
name: Link Scanner on: push: branches: - master jobs: scan: runs-on: ubuntu-latest steps: - name: Checkout Repository uses: actions/checkout@v2 - name: Scan Links uses: docker://btoll/link-scanner:latest with: args: -dir content/post -v