Bingology - The Blog of Aaron 'BingoBoingo' Rogier

ADD7A9A28F85E5EF1F51904F309BB8D7F3251143
About | Contact | PGP Public Key | Archive
« Outreach Automation: A Call For Bids
Qntra: A Plan For Action »

A Homework Assignment From Diana_Coman: Trawling Ancient PMs Seeking What Worked For Early Qntra And Where I'm At On Scripting A Conversion Engine

This assignement was issued in #Ossasepia, where a number of young hands have gathered to work on fixing their heads under Diana Coman's guidance:

diana_coman: BingoBoingo: so listen, do yourself a favour for starters, trawl those pms or whatever and write up a summary with what you 2 tried and what worked and what didn't, in what way, etc; write up somewhere in clear also what your current script does and what/where you're stuck + why; I honestly couldn't quite follow at that level of detail from the chans only.

Insights From Re-Reading Ancient History - First Do Harm

Here's the short of what I found re-reading my conversations with cazalla, 2014 - 2015.

  • We aggressively picked fights.
  • In aggressively picking fights, we cosntantly pushed each other.
  • We carried on an actie back and forth searching out the deepest cuts to our targets.
  • We actively searched for the small sort of bad/dumb actors we could target in an effort to their attention and the traffic that follows.
  • While we valued keeping the line moving and pushing out stories, we discussed the ones that contributed to our offensive posture.

Re-reading the ancient PM logs between cazalla and myself with an eye to what drove growth, several things stand out though one stands above the others. From the very first post, we were picking fights, sticking to them getting attention from folks with audiences... or at the very least shills.

GAW Miners - During the second half of 2014 one of the most visible marketing pushes was GAW Miners/Great Awk Wireless/ZenMiner marketing "hashlet" "cloud mining" contracts using images of these smooth "iDevice" looking photoshoped images in their banners. I'd noticed them on this blog in August as a sort of "who is going to fall for this", but when cazalla took ownership of the story on Qntra, the aggression was dialed up substantially. This line of stories even lead to leaks!

Not every early fight drew a crowd, like the Bit-Con guy. Still, we quickly had hit and run kids helping to sweep the scammy space.1 Not everything got traction, but the stuff that did got a lot.

As the GAW story unfolded we kept a particularly close eye on CoinFire, another publication born at nearly the same time on Qntra that also aggressively pursued the GAW story. Unlike Qntra, CoinFire imported weird fiatisms into their process of the "Our writers can't hold Bitcoin for the conflict of interest" variety and so on. While Qntra suffered packet weather from the start, CoinFire was less fortunate. In covering CoinFire's misfortune, a particularly loud wailing GAW shill by the nick BitJane was targeted for her real time revisions of history as it was unfolding.

By the time the new year came around, GAW began collapsing, but we'd already pivoted toward targeting the upcoming hardfork derpery while wrapping up the Silk Road show trial coverage.

This round of re-reading was informative in reminding me that Qntra was not born a detached spectator. As outreach goes on, I've got to push myself into aggressing aggrievable actors. Re-creating the competitive spirit that feeds this aggression in a logged channel may assist with this,2 but first I have to embrace the aggression myself.

Status On The Crawler Script

So far on the crawler I've managed one line that works, one line that needs refinement, and I have a third I'm struggling to beat into shape. It is taking a different, more brute shape than the one I originally tried to spec.

curl http://domain.tld | grep -Eo "(http|https)://[a-zA-Z0-9./?=_-]*" | sort -u | cat >> churn

This line pulls all the urls present on a page and writes them to a file named churn.

grep -oP '[^\./]*\.[^\./]*(:|/)' churn | sed -e 's/\(:.*\/\|\/\)//g' | awk '!seen[$0]++' | cat >> churndomains

This line takes all the urls in churn and creates a shorter file churndomains consisting of domain.tld with the unfortunate effect of killing subdomains of the x.wordpress.com and y.blogspot.com for the sin3 of using lifted filters. Eating the filtering logic or replacing it with awk to save those subdomains is on the to do list.

cat churndomains | awk '{ print "http://" $0 system("curl --remote-name " $0 )}' | grep -Eo "(http|https)://[a-zA-Z0-9./?=_-]*" | sort -u | cat >> churn2

This like fails after the first pipe in the following way:

http://t.me23
curl: Remote file name has no length!
curl: try 'curl --help' or 'curl --manual' for more information
http://wordpress.org23
curl: Remote file name has no length!
curl: try 'curl --help' or 'curl --manual' for more information
http://thezman.com23
curl: Remote file name has no length!
curl: try 'curl --help' or 'curl --manual' for more information
http://thepoliticalcesspool.org23

Fixing the second line creating churndomains to preserve http://everything.tld/ and making the cut after the third "/" strikes me as the most right solution instead of the current, unseccessful effort to print an "http://" in front of the frequently truncated domains.

From here the plan is to repeat the steps in lines 2 and 3 another 2 or three times to do a wide, brute force sweep out from the starting site and producing a massive list of urls churnx for however many steps out the sweep walked. From here the steps that I see as necessary shift to narrowing the list.

  1. Take the urls from churnx, filtering them against common file extensions .jpg, .js, .pdf, .mp4, etc to cut out urls obviously advertising they lead to things other than blog posts. This will produce a file Filtered1
  2. Curls the urls in Filtered1 checking for urls with useful comment boxes producing a file Filtered2
  3. Deduplicates the Filtered2's contents for repetitions of the same http://everything.tld/ pattern producing a file targets

At this point I am uncertain which ordering of the last two steps is going to lead to a script that delivers a better targets file. Practice is likely to inform. Once I get this working, a comment checker script that eats targets, checks for strings indicating a comment went through to return a file placed should be a simpler exercise. A comment shitter script for the automated submission of comments is likely to be a trickier thing to get right,4 but simply automating target discovery and comment placement checking removes the largest demands on my attention in outreach5 via blog commenting.

Once I have these scripts hammered into shape, I look forward to finding further tasks to wrestle with in building greater familiarity with the tools. I've got deficits to fill and neither ignoring them, rolling over to die, nor trying to cover them will fill them. I have to put first things first and attack them.

  1. That particular one lost to Jesus. [↩]
  2. Trying to rebuild the competitive environment in PMs again rather than logged channels would of course be insane. [↩]
  3. And it is a sin. I've got to eat enough of the tools to write my own filters. [↩]
  4. This doesn't make it any less necessary to have on hand [↩]
  5. A pursuit which, as rediscovered above, means keeping an eye out for opportunities to pick fights. [↩]

This entry was posted on Thursday, February 20th, 2020 at 9:20 p.m. and is filed under Computing, Exercises, Housekeeping. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

3 Responses to “A Homework Assignment From Diana_Coman: Trawling Ancient PMs Seeking What Worked For Early Qntra And Where I'm At On Scripting A Conversion Engine”

  1. BingoBoingo says:
    February 21, 2020 at 12:00 a.m.

    And I've managed to put together:

    cut -f 1,2,3 -d / churn

    To fix the problem of getting everything after the .tld cut off of urls. Baby steps.

    Reply
  2. Diana Coman says:
    February 21, 2020 at 1:21 p.m.

    For the trawl of past logs - is the attacking stance the only thing you and cazalla tried there? Because I don't see *at all* that part with "what didn't work". It's all nice and helpful to identify what (something!) that worked and focus on that but a proper review of whatever it is has to at list *list* also what did not work (at least so you don't try it again!).

    Fwiw the "attacking" is actually a good point and I think it goes deeper aka what you need (and looking back at it I think re Pizarro too) is *active* as opposed to a weird sort of passive that you tend to default to (and even mask with ever grander words otherwise, ugh).

    For those scripts - can you say exactly what each part of them does and why it's there? It's not enough if it "works" as in you get what you expect - use the opportunity to learn the darned thing, not to lift filters or whatever else. I don't get it - are you somehow in an ill-informed hurry/flurry on this or why such weird approach anyway?

    To put it plainly: if it's worth doing, it is worth doing well. And the sort of "pressure" or hurry or whatever that results in throwing shit at the wall until something sticks is not helping, it just makes everything take *longer* and deliver *less* (+shitty walls but anyways). So, to sort this out: what is the exact sequence of steps that you have already done manually (you did, right?) so that you know what you are now trying to automate? List them neatly, then pick the first, pick your tool (if it's web then probably curl, if it's local text processing then possibly awk), read the man page and see what is useful from there; if you need to, print the darned manual and take the options one at a time and/or *ask* intelligently (aka something like: want to do this exact thing, and I'm thinking to use that and this option because x, y, z, but unsure because t,q,w, can anyone help me figure out wtf here?)

    Reply
    • BingoBoingo says:
      February 22, 2020 at 12:14 a.m.

      Thank you.

      For the trawl of past logs - is the attacking stance the only thing you and cazalla tried there? Because I don't see *at all* that part with "what didn't work". It's all nice and helpful to identify what (something!) that worked and focus on that but a proper review of whatever it is has to at list *list* also what did not work (at least so you don't try it again!).

      Fwiw the "attacking" is actually a good point and I think it goes deeper aka what you need (and looking back at it I think re Pizarro too) is *active* as opposed to a weird sort of passive that you tend to default to (and even mask with ever grander words otherwise, ugh).

      It's the big one. We went into unaffiliated platforms: reddit, twitter, etc to pick fights. Reddit occasionally drew traffic. Twitter never seemed to, but on there occasionally we could bait folks with thin skin into addressing us off Twitter.

      The one complete flop was a push to get on "Google News". What was in there that also didn't work was the venting and commiserating that consistitute the bulk of the lines in a portion that increased over time. Near the end that's just about all the log was in yet another, albeit earlier, case for the pitfalls of not using logged channels for organizing.

      For those scripts - can you say exactly what each part of them does and why it's there? It's not enough if it "works" as in you get what you expect - use the opportunity to learn the darned thing, not to lift filters or whatever else. I don't get it - are you somehow in an ill-informed hurry/flurry on this or why such weird approach anyway?

      To put it plainly: if it's worth doing, it is worth doing well. And the sort of "pressure" or hurry or whatever that results in throwing shit at the wall until something sticks is not helping, it just makes everything take *longer* and deliver *less* (+shitty walls but anyways). So, to sort this out: what is the exact sequence of steps that you have already done manually (you did, right?) so that you know what you are now trying to automate? List them neatly, then pick the first, pick your tool (if it's web then probably curl, if it's local text processing then possibly awk), read the man page and see what is useful from there; if you need to, print the darned manual and take the options one at a time and/or *ask* intelligently (aka something like: want to do this exact thing, and I'm thinking to use that and this option because x, y, z, but unsure because t,q,w, can anyone help me figure out wtf here?)

      I started in a hurry. I'll slow down and revisit the parts I'm not certain about replacing them with ones I understand. Today, after re-reading the grep man page, that means I have the following two lines:

      curl http://startingdomain.tld | grep -Eo "(http|https)://[a-zA-Z0-9./?=_-]*" | awk '!seen[$0]++' | cat >> churn
      cut -f 1,2,3 -d / churn | awk '!seen[$0]++' | cat >> churndomains

      The steps that I'd been doing manually, and I've trying to rewrite into an automation friendly form were at an ugly, awkward, and intermediate level of abstraction:

      1) Take a browser. Take a blog. Open all of the side bar links.
      2) On each of those I'd go to the most recent post and look for a comment box.
      3) If one was present, I'd slip a comment in.
      4) If the blog was rich in sidebar links I'd open all the new ones.
      5) Repeat to exhaustion finding few comment boxes.
      6) Later revisit commented blogs looking for comment placement successes.

      It's ugly and it imports peculiarities of both the meat in the chair and the web browser. I'm starting to calm. Finding the cut tool and putting together its clean replacement of the ugly second invocation of grep was a boost. I'll unroll what I'm trying to do with the rest of the crawler again and try to keep the pressure off lest flinging find its way back into the process.

      Thank you again for the productive thread today and the clarity that followed.

      Reply

Leave a Reply

Click here to cancel reply.

 

It's still a pleasure to read bb prose. Both well researched and well written...

- Mircea Popescu

Recent Posts

  • Uruguay-SSR And The Hallucinated Seige
  • Introducing "The Montevideo Standard"
  • Qntra: A Plan For Action
  • A Homework Assignment From Diana_Coman: Trawling Ancient PMs Seeking What Worked For Early Qntra And Where I'm At On Scripting A Conversion Engine
  • Outreach Automation: A Call For Bids
  • Week 6 2020 Review - With Some Reflections On The Subject Of Feedback And Encountering Bots Blogging For Bots Nest
  • Photos From The Archives - January 20, 2011
  • Week 5 2020 Review - A Start To A Start
  • An Onramp For Contributing To Qntra - On Qntra
  • Week 4 2020 Review - Turning To Qntra

Recent Comments

  • Joe on Sports Team Fandoms as a Model Organism for Understanding Discourse
  • Alaskan Thunder Fuck on That One Agricultural Product And Uruguay
  • Aaron 'BingoBoingo' Rogier on Qntra: A Plan For Action
  • Aaron 'BingoBoingo' Rogier on Some FG Samples And Test Results
  • Mohammed nawar on Some FG Samples And Test Results
  • BetrugsRuehrerVow on Ceviche Theory And Practice
  • Aaron 'BingoBoingo' Rogier on Introducing "The Montevideo Standard"

Feeds

  • Posts RSS
  • Comments RSS


Tip Jar: 15eVXAW7k8uKc5moDFUSc9Y3jmHFAenNXo