Bingology - The Blog of Aaron 'BingoBoingo' Rogier

ADD7A9A28F85E5EF1F51904F309BB8D7F3251143
About | Contact | PGP Public Key | Archive
« Week 6 2020 Review - With Some Reflections On The Subject Of Feedback And Encountering Bots Blogging For Bots Nest
A Homework Assignment From Diana_Coman: Trawling Ancient PMs Seeking What Worked For Early Qntra And Where I'm At On Scripting A Conversion Engine »

Outreach Automation: A Call For Bids

Discussion in #trilema-hanbot this week lead to a rough outline for a two or three piece toolset for leveraging automation to assist in Qntra outreach efforts. In order of priority the toolset needs to consist of at least items one and two:

  1. A Blog Crawler: This tool needs to take as input a starting url. Using curl the crawler will grab the page, grab outbound urls1 from the starting point, and begin crawling in search of blogs with live comment boxes. In most cases2 This will mean going from the directly linked page to the top linked post on the blog and seeing if a comment box is there. Whether the crawler should crawl to complete the sweep out to a certain depth or crawl to accomplish a certain number of iteration per run is unclear to me at this time though I suspect the latter is the more manageable approach. At the end of each run, the crawler should produce a list of urls to blog posts with comment boxes and report the number of targets it found out of the number of total urls crawled.
  2. A Comment checker: Far simpler, this takes a list of urls of the sort produced by the above crawler and returns which of those urls contain one of several strings indicating a Qntra outreach comment successfully reached publication along with the number of successes out of the total urls checked per run.
  3. (Maybe?) A Comment Shitter: Unlike the two tools above, sample code for accomplishing this is common as are commercial advertisements by folks advertising they operate this sort of script. Depending on the post, a clearly human comment can be written in 3 to 8 minutes. By contrast finding a live blog that can take a comment using my own eyes runs anywhere from 2 to 20+ minutes biased towards the high end. On Google's blogspot platform, when the blogger decides to allow name/url identification for commenters there is a 100% success rate in comment publication after doing a 10-ish second task to help Google train their cars' eyes. The situation on Automattic's Wordpress fork is more complicated, but Automattic's bias towards preventing actual commucation between persons leaves a lot of open questions to be explored.

Thusly I have a problem. As I work on learning way to make the computer do more for me and plugging a skills deficit, the comment checker strikes me as the sort of limited scope problem that makes a fine sample problem for automating. The crawler however involves a greater deal of complexity, and is likely to bring in a larger number of tools. As wrestling with what the crawler should do and what needs to be done to implement it is taking substantial space in my head, I am inclined to lean on Auctionbot to see if anyone's up for being hired to produce an initial version of the crawler which I can then deploy, read, study, and learn from.

Proposed Crawler Specifications

I am seeking a web crawling script that does the following:

  1. Takes as input a url and optionally a number of iterations, i. 1000 is probably a safe initial default value for i3
  2. Grabs the url with curl collecting all outgoing links, writes each url as a line in a text file named churn
  3. Grabs the top url in churn with curl, follows the first link in an html div named "main" and checks to see if that page has a Wordpress or Blogspot comment box allowing comments with a name/url identity. If yes, that page's url is written as a line in a text file named targets. If there is no comment box, the page's url is added as a line to a file named pulp. Outgoing urls on the page are appended to the bottom of churn
  4. Adds the url from churn and the second url as lines into a file named churned and removes them from churn
  5. Checks new top item in churn against churned. If the url isn't in churned, it processes it as in 3, otherwise it adds the url to churned and removes it once again from churn.4
  6. After performing items 3, 4, and 5 for i iterations, writes a line at the bottom of targets stating "X potential targets identified this run." and at the bottom of pulp writes a line "Y monkeys scribbling on electric paper". X is to be the number of new lines added to targets during a run while Y is to be the number of new lines added to pulp after a run.

Notes On The Spec

The decision to run for a set number of iterations rather than walking a set depth from the first url was made after trying to chart out the additional complexity involved in charting this varied thing we call the internet out to specified degrees of depth. Going out to the first degree might be fine. The second may even be manageable. I've not been seeing many blogs that keep their blogrolls trim and limited at active blogs. I suspect that by the time a complete sweep of the third degree or fourth degree of seperation is completed, run times would be geological without re-inventing some sort of early Googlebot.5

If this project receives multiple bids, I may hire multiple bidders to each produce their own implementation.

  1. Initially I thought sidebar links specifically would be the thing to crawl, imitating the more productive method I found for manually crawling. After viewing the page source on a number of blogs, the is substantial variety in how sidebars are maked as such in the code. Identifying a blog roll is easy with human eyes, but much harder for computers. [↩]
  2. Though the logic of doing this in all cases is likely to be far simpler. [↩]
  3. Of course practice will inform. [↩]
  4. The reason to put duplicate lines into churned is that in looking at churned after a run, some popular things might be identified. [↩]
  5. This isn't to say a search engine using the original PageRank algorithm and similar, but seeded from a core of Republican blogs wouldn't be useful. It simply isn't useful for the task of churning through the muck of strangers looking for potential people. [↩]

This entry was posted on Sunday, February 16th, 2020 at 1:17 a.m. and is filed under Computing, Exercises, Housekeeping. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

6 Responses to “Outreach Automation: A Call For Bids”

  1. Ingram says:
    February 18, 2020 at 3:58 p.m.

    I can try to help you with this.

    Reply
    • BingoBoingo says:
      February 18, 2020 at 8:09 p.m.

      I've got to wrestle with the tools and work on putting this together myself. Trying to hire help for a task using tools I've not mastered was a mistake.

      Reply
  2. PeterL says:
    February 19, 2020 at 1:17 p.m.

    I see that you say you need to do this yourself, but it sounded like an interesting problem so I threw something together really quick yesterday. You can use this if you want, or ignore it and make your own thing.

    http://peterl.xyz/2020/02/web-crawler-for-finding-commentable-blogs/

    Reply
    • BingoBoingo says:
      February 19, 2020 at 9:27 p.m.

      Thank you for the thought. I'm making some process towards a crawler implemented as a shell script.

      Reply
  3. Ingram says:
    February 19, 2020 at 11:59 p.m.

    What is the difficult part of the script implementation?

    Reply
    • BingoBoingo says:
      February 20, 2020 at 7:12 p.m.

      For me, grasping the tools and pushing my brain to break the problem down into algorithmic pieces as I'm discovering what the tools can do for me.

      Reply

Leave a Reply

Click here to cancel reply.

 

It's still a pleasure to read bb prose. Both well researched and well written...

- Mircea Popescu

Recent Posts

  • Dead Internet Theory And Searching For The Sticky
  • Adventures In Video
  • An Exploration In Multi Media Toilets
  • Ritalin Diaries Part 1
  • How "Therapy" Re-Fucked My Head - Short Form
  • Uruguay-SSR And The Hallucinated Seige
  • Introducing "The Montevideo Standard"
  • Qntra: A Plan For Action
  • A Homework Assignment From Diana_Coman: Trawling Ancient PMs Seeking What Worked For Early Qntra And Where I'm At On Scripting A Conversion Engine
  • Outreach Automation: A Call For Bids

Recent Comments

  • Aaron 'BingoBoingo' Rogier on The Theoretical Foundation of Social Engineering Practice
  • Aaron 'BingoBoingo' Rogier on That One Agricultural Product And Uruguay
  • Verisimilitude on How "Therapy" Re-Fucked My Head - Short Form
  • Name on How "Therapy" Re-Fucked My Head - Short Form
  • Aaron 'BingoBoingo' Rogier on How "Therapy" Re-Fucked My Head - Short Form
  • Aaron 'BingoBoingo' Rogier on How "Therapy" Re-Fucked My Head - Short Form
  • Thimbronion on How "Therapy" Re-Fucked My Head - Short Form

Feeds

  • Posts RSS
  • Comments RSS


Tip Jar: 15eVXAW7k8uKc5moDFUSc9Y3jmHFAenNXo