Discussion in #trilema-hanbot this week lead to a rough outline for a two or three piece toolset for leveraging automation to assist in Qntra outreach efforts. In order of priority the toolset needs to consist of at least items one and two:
- A Blog Crawler: This tool needs to take as input a starting url. Using curl the crawler will grab the page, grab outbound urls1 from the starting point, and begin crawling in search of blogs with live comment boxes. In most cases2 This will mean going from the directly linked page to the top linked post on the blog and seeing if a comment box is there. Whether the crawler should crawl to complete the sweep out to a certain depth or crawl to accomplish a certain number of iteration per run is unclear to me at this time though I suspect the latter is the more manageable approach. At the end of each run, the crawler should produce a list of urls to blog posts with comment boxes and report the number of targets it found out of the number of total urls crawled.
- A Comment checker: Far simpler, this takes a list of urls of the sort produced by the above crawler and returns which of those urls contain one of several strings indicating a Qntra outreach comment successfully reached publication along with the number of successes out of the total urls checked per run.
- (Maybe?) A Comment Shitter: Unlike the two tools above, sample code for accomplishing this is common as are commercial advertisements by folks advertising they operate this sort of script. Depending on the post, a clearly human comment can be written in 3 to 8 minutes. By contrast finding a live blog that can take a comment using my own eyes runs anywhere from 2 to 20+ minutes biased towards the high end. On Google's blogspot platform, when the blogger decides to allow name/url identification for commenters there is a 100% success rate in comment publication after doing a 10-ish second task to help Google train their cars' eyes. The situation on Automattic's Wordpress fork is more complicated, but Automattic's bias towards preventing actual commucation between persons leaves a lot of open questions to be explored.
Thusly I have a problem. As I work on learning way to make the computer do more for me and plugging a skills deficit, the comment checker strikes me as the sort of limited scope problem that makes a fine sample problem for automating. The crawler however involves a greater deal of complexity, and is likely to bring in a larger number of tools. As wrestling with what the crawler should do and what needs to be done to implement it is taking substantial space in my head, I am inclined to lean on Auctionbot to see if anyone's up for being hired to produce an initial version of the crawler which I can then deploy, read, study, and learn from.
Proposed Crawler Specifications
I am seeking a web crawling script that does the following:
- Takes as input a url and optionally a number of iterations, i. 1000 is probably a safe initial default value for i3
- Grabs the url with curl collecting all outgoing links, writes each url as a line in a text file named churn
- Grabs the top url in churn with curl, follows the first link in an html div named "main" and checks to see if that page has a Wordpress or Blogspot comment box allowing comments with a name/url identity. If yes, that page's url is written as a line in a text file named targets. If there is no comment box, the page's url is added as a line to a file named pulp. Outgoing urls on the page are appended to the bottom of churn
- Adds the url from churn and the second url as lines into a file named churned and removes them from churn
- Checks new top item in churn against churned. If the url isn't in churned, it processes it as in 3, otherwise it adds the url to churned and removes it once again from churn.4
- After performing items 3, 4, and 5 for i iterations, writes a line at the bottom of targets stating "X potential targets identified this run." and at the bottom of pulp writes a line "Y monkeys scribbling on electric paper". X is to be the number of new lines added to targets during a run while Y is to be the number of new lines added to pulp after a run.
Notes On The Spec
The decision to run for a set number of iterations rather than walking a set depth from the first url was made after trying to chart out the additional complexity involved in charting this varied thing we call the internet out to specified degrees of depth. Going out to the first degree might be fine. The second may even be manageable. I've not been seeing many blogs that keep their blogrolls trim and limited at active blogs. I suspect that by the time a complete sweep of the third degree or fourth degree of seperation is completed, run times would be geological without re-inventing some sort of early Googlebot.5
If this project receives multiple bids, I may hire multiple bidders to each produce their own implementation.
- Initially I thought sidebar links specifically would be the thing to crawl, imitating the more productive method I found for manually crawling. After viewing the page source on a number of blogs, the is substantial variety in how sidebars are maked as such in the code. Identifying a blog roll is easy with human eyes, but much harder for computers. [↩]
- Though the logic of doing this in all cases is likely to be far simpler. [↩]
- Of course practice will inform. [↩]
- The reason to put duplicate lines into churned is that in looking at churned after a run, some popular things might be identified. [↩]
- This isn't to say a search engine using the original PageRank algorithm and similar, but seeded from a core of Republican blogs wouldn't be useful. It simply isn't useful for the task of churning through the muck of strangers looking for potential people. [↩]