This assignement was issued in #Ossasepia, where a number of young hands have gathered to work on fixing their heads under Diana Coman's guidance:
diana_coman: BingoBoingo: so listen, do yourself a favour for starters, trawl those pms or whatever and write up a summary with what you 2 tried and what worked and what didn't, in what way, etc; write up somewhere in clear also what your current script does and what/where you're stuck + why; I honestly couldn't quite follow at that level of detail from the chans only.
Insights From Re-Reading Ancient History - First Do Harm
Here's the short of what I found re-reading my conversations with cazalla, 2014 - 2015.
- We aggressively picked fights.
- In aggressively picking fights, we cosntantly pushed each other.
- We carried on an actie back and forth searching out the deepest cuts to our targets.
- We actively searched for the small sort of bad/dumb actors we could target in an effort to their attention and the traffic that follows.
- While we valued keeping the line moving and pushing out stories, we discussed the ones that contributed to our offensive posture.
Re-reading the ancient PM logs between cazalla and myself with an eye to what drove growth, several things stand out though one stands above the others. From the very first post, we were picking fights, sticking to them getting attention from folks with audiences... or at the very least shills.
GAW Miners - During the second half of 2014 one of the most visible marketing pushes was GAW Miners/Great Awk Wireless/ZenMiner marketing "hashlet" "cloud mining" contracts using images of these smooth "iDevice" looking photoshoped images in their banners. I'd noticed them on this blog in August as a sort of "who is going to fall for this", but when cazalla took ownership of the story on Qntra, the aggression was dialed up substantially. This line of stories even lead to leaks!
Not every early fight drew a crowd, like the Bit-Con guy. Still, we quickly had hit and run kids helping to sweep the scammy space.1 Not everything got traction, but the stuff that did got a lot.
As the GAW story unfolded we kept a particularly close eye on CoinFire, another publication born at nearly the same time on Qntra that also aggressively pursued the GAW story. Unlike Qntra, CoinFire imported weird fiatisms into their process of the "Our writers can't hold Bitcoin for the conflict of interest" variety and so on. While Qntra suffered packet weather from the start, CoinFire was less fortunate. In covering CoinFire's misfortune, a particularly loud wailing GAW shill by the nick BitJane was targeted for her real time revisions of history as it was unfolding.
By the time the new year came around, GAW began collapsing, but we'd already pivoted toward targeting the upcoming hardfork derpery while wrapping up the Silk Road show trial coverage.
This round of re-reading was informative in reminding me that Qntra was not born a detached spectator. As outreach goes on, I've got to push myself into aggressing aggrievable actors. Re-creating the competitive spirit that feeds this aggression in a logged channel may assist with this,2 but first I have to embrace the aggression myself.
Status On The Crawler Script
So far on the crawler I've managed one line that works, one line that needs refinement, and I have a third I'm struggling to beat into shape. It is taking a different, more brute shape than the one I originally tried to spec.
curl http://domain.tld | grep -Eo "(http|https)://[a-zA-Z0-9./?=_-]*" | sort -u | cat >> churn
This line pulls all the urls present on a page and writes them to a file named churn.
grep -oP '[^\./]*\.[^\./]*(:|/)' churn | sed -e 's/\(:.*\/\|\/\)//g' | awk '!seen[$0]++' | cat >> churndomains
This line takes all the urls in churn and creates a shorter file churndomains consisting of domain.tld with the unfortunate effect of killing subdomains of the x.wordpress.com and y.blogspot.com for the sin3 of using lifted filters. Eating the filtering logic or replacing it with awk to save those subdomains is on the to do list.
cat churndomains | awk '{ print "http://" $0 system("curl --remote-name " $0 )}' | grep -Eo "(http|https)://[a-zA-Z0-9./?=_-]*" | sort -u | cat >> churn2
This like fails after the first pipe in the following way:
http://t.me23
curl: Remote file name has no length!
curl: try 'curl --help' or 'curl --manual' for more information
http://wordpress.org23
curl: Remote file name has no length!
curl: try 'curl --help' or 'curl --manual' for more information
http://thezman.com23
curl: Remote file name has no length!
curl: try 'curl --help' or 'curl --manual' for more information
http://thepoliticalcesspool.org23
Fixing the second line creating churndomains to preserve http://everything.tld/ and making the cut after the third "/" strikes me as the most right solution instead of the current, unseccessful effort to print an "http://" in front of the frequently truncated domains.
From here the plan is to repeat the steps in lines 2 and 3 another 2 or three times to do a wide, brute force sweep out from the starting site and producing a massive list of urls churnx for however many steps out the sweep walked. From here the steps that I see as necessary shift to narrowing the list.
- Take the urls from churnx, filtering them against common file extensions .jpg, .js, .pdf, .mp4, etc to cut out urls obviously advertising they lead to things other than blog posts. This will produce a file Filtered1
- Curls the urls in Filtered1 checking for urls with useful comment boxes producing a file Filtered2
- Deduplicates the Filtered2's contents for repetitions of the same http://everything.tld/ pattern producing a file targets
At this point I am uncertain which ordering of the last two steps is going to lead to a script that delivers a better targets file. Practice is likely to inform. Once I get this working, a comment checker script that eats targets, checks for strings indicating a comment went through to return a file placed should be a simpler exercise. A comment shitter script for the automated submission of comments is likely to be a trickier thing to get right,4 but simply automating target discovery and comment placement checking removes the largest demands on my attention in outreach5 via blog commenting.
Once I have these scripts hammered into shape, I look forward to finding further tasks to wrestle with in building greater familiarity with the tools. I've got deficits to fill and neither ignoring them, rolling over to die, nor trying to cover them will fill them. I have to put first things first and attack them.
- That particular one lost to Jesus. [↩]
- Trying to rebuild the competitive environment in PMs again rather than logged channels would of course be insane. [↩]
- And it is a sin. I've got to eat enough of the tools to write my own filters. [↩]
- This doesn't make it any less necessary to have on hand [↩]
- A pursuit which, as rediscovered above, means keeping an eye out for opportunities to pick fights. [↩]
And I've managed to put together:
To fix the problem of getting everything after the .tld cut off of urls. Baby steps.
For the trawl of past logs - is the attacking stance the only thing you and cazalla tried there? Because I don't see *at all* that part with "what didn't work". It's all nice and helpful to identify what (something!) that worked and focus on that but a proper review of whatever it is has to at list *list* also what did not work (at least so you don't try it again!).
Fwiw the "attacking" is actually a good point and I think it goes deeper aka what you need (and looking back at it I think re Pizarro too) is *active* as opposed to a weird sort of passive that you tend to default to (and even mask with ever grander words otherwise, ugh).
For those scripts - can you say exactly what each part of them does and why it's there? It's not enough if it "works" as in you get what you expect - use the opportunity to learn the darned thing, not to lift filters or whatever else. I don't get it - are you somehow in an ill-informed hurry/flurry on this or why such weird approach anyway?
To put it plainly: if it's worth doing, it is worth doing well. And the sort of "pressure" or hurry or whatever that results in throwing shit at the wall until something sticks is not helping, it just makes everything take *longer* and deliver *less* (+shitty walls but anyways). So, to sort this out: what is the exact sequence of steps that you have already done manually (you did, right?) so that you know what you are now trying to automate? List them neatly, then pick the first, pick your tool (if it's web then probably curl, if it's local text processing then possibly awk), read the man page and see what is useful from there; if you need to, print the darned manual and take the options one at a time and/or *ask* intelligently (aka something like: want to do this exact thing, and I'm thinking to use that and this option because x, y, z, but unsure because t,q,w, can anyone help me figure out wtf here?)
Thank you.
It's the big one. We went into unaffiliated platforms: reddit, twitter, etc to pick fights. Reddit occasionally drew traffic. Twitter never seemed to, but on there occasionally we could bait folks with thin skin into addressing us off Twitter.
The one complete flop was a push to get on "Google News". What was in there that also didn't work was the venting and commiserating that consistitute the bulk of the lines in a portion that increased over time. Near the end that's just about all the log was in yet another, albeit earlier, case for the pitfalls of not using logged channels for organizing.
I started in a hurry. I'll slow down and revisit the parts I'm not certain about replacing them with ones I understand. Today, after re-reading the grep man page, that means I have the following two lines:
The steps that I'd been doing manually, and I've trying to rewrite into an automation friendly form were at an ugly, awkward, and intermediate level of abstraction:
1) Take a browser. Take a blog. Open all of the side bar links.
2) On each of those I'd go to the most recent post and look for a comment box.
3) If one was present, I'd slip a comment in.
4) If the blog was rich in sidebar links I'd open all the new ones.
5) Repeat to exhaustion finding few comment boxes.
6) Later revisit commented blogs looking for comment placement successes.
It's ugly and it imports peculiarities of both the meat in the chair and the web browser. I'm starting to calm. Finding the cut tool and putting together its clean replacement of the ugly second invocation of grep was a boost. I'll unroll what I'm trying to do with the rest of the crawler again and try to keep the pressure off lest flinging find its way back into the process.
Thank you again for the productive thread today and the clarity that followed.