1. Jul 6th, 2008

    Smart spam and new comment policy

    Akismet has protected your site from 492,588 spam comments already.

    At the moment, my blog lists 44 comments identified as spam for me to review, collected over the past couple of weeks. That’s a drop in the bucket compared to the times when it would ask me to review hundreds of spam comments a day. Akismet must be doing something right.

    So are spammers.

    Of these 44, a bunch of comments look like this:

    Thanks assaf.. Thats well done work. We appreciate hard work

    This:

    I totally agree – I used to spend a lot of time going over code I had written that had so many mistakes in it – not that were clearly moticeable, but if I’d taken a good look through then I would’ve notice the mistakes. It proves that taking the time to check things always comes good in the end.

    And this:

    Will this plugin work with the lastest wordpress version?

    And unlike dumb spam, which you can tell just by looking at the shady content, these comments all relate to the post they’re commenting on. In fact, a couple were valuable responses that I approved them at first. Seriously.

    I’m not showing them here because they’d make no sense out of context, but next to the post they’re commenting on, they actually add something to the conversation. Too bad they link to a shady site.

    Welcome to smart spam

    JH is getting similar smart spam, except in JH’s case the comments are ripped off from other sites, like DZone. That’s an interesting take on things. Instead of looking at sites like DZone, Reddit or FriendFeed as destinations for spam, they can be turned into sources of valuable spammunition.

    Update: And so is Karl Fogel. Karl has some good ideas on moderating comments when “the spams are the stream, and the problem is to pick out the rare hams.” (also read through the comments)

    I couldn’t find any source for these comments. The first one is bespoke, though something that’s obviously easy to automate by grabbing the author name from the feed. The second one doesn’t show up on Google, neither does the third (notice the typo, unless it’s smarter than I give it credit for).

    Perhaps we’re looking into cheap labor. It would cost more to generate, but you offest the cost with higher penetration and survival rate. With crap spam, either I or one of my readers will notice it, and it will get removed. With smart spam, once it gets through it will probably stick around forever.

    In fact, I just went through the recent comments and discovered a couple that had good content, but shady links.

    I was starting at crap spam for so long, in emails and blog comments, that I just assumed it’s a numbers game of crap trying to overwhelm filters. Time to give spammers the credit they deserve. The old filters are not effective anymore, not against a reasonable amount of smart spam.

    New comment policy

    If you don’t feel like it, don’t leave a URL with your comment. But I would appreciate if you leave a URL linking back to your blog, I want to check it out, I’m sure some of my readers too. And you will get PageRank for your effort (I’m turning no-follow off).

    But if the link leads to advertising, company or product page, empty shell blog, or anything SEO-related, I’ll mark it as spam.

    1. Jul 6th, 2008

      Crosbie Fitch

      Karl Fogel expressed a similar concern recently. See his post Spam Insidy.

      The tricky cases are the borderline ones, but at the end of the day, if a comment adds value to a conversation it doesn’t matter if a robot or sharecropper submitted it.

      See the not unjustified indignation that results when you suggest a genuine human commenter has failed to attain the same quality of submission as a comment spammer: http://zeroinfluence.wordpress.com/2007/07/19/links-for-2007-07-19/

    2. Jul 6th, 2008

      Scott Markwell

      So what do you think the solution is? I don’t have comments on my ramblings on the web, but to some purists this doesn’t make it a blog. Forcing users to sign-up for an account on your site is fruitless, a hurtle easily jumped by spammers for bigger sites and captcha is pretty close to useless (for smaller sites who don’t have machine scientists and statistical analysis on staff) but legitimate users will stay away from unless the site represents an actual community they wish to interact with.

      Using a non standard CMS is another hurtle for spammers, which I hope will at least slow down the vast majority of idiocy if I ever open up comments.

    3. Jul 6th, 2008

      Scott Markwell

      Oddly enough, doing OpenID didn’t allow me to inject my site, you may wish to look at that mechanism.

    4. Jul 6th, 2008

      Assaf

      @scott. The URL is optional, you don’t have to put anything. But if you do, that URL has to be you, not a link to someone who pays you to comment on their behalf. There will obviously be exceptions, I write about companies, I accept them to comment back. It’s not a hard and fast rule, I do all the filtering, so there’s judgment involved.

      But if you’re replying to a post about HTTP load balancing and the URL is for a site selling flowers online, you’re most likely to be flagged as spam.

      OpenID has the notion of an identity URL, so when you leave a comment with OpenID it links back to your identity URL. Which is exactly what the comment URL field is meant to be.

      You can use a different identity URL, and Verisign as your identity provider, by adding openid.server and openid.delegate links on your identity page (do a view source on this page and you’ll see them). That tells OpenID to use that page for your identity, and the Verisign server to authenticate it.

    5. Jul 6th, 2008

      Assaf

      @crosbie. Thanks, adding a link to Karl’s post. I’m guessing this is a new phenomenon?

      The “add value” is a problem. On one post I got two comments discussing whether the code works on PHP4. Does it pass the “add value” test?

      It certainly passes the “written in 10 seconds and linking to a spammy site” test. And it’s more than likely the commenter never tested it with PHP4, just picking relevant keywords to pass the “add value” test. Would it be worth keeping?

      Unfortunately, it is getting harder for people to pass the turing test. I’m not sure what to do about that (ideas are welcome). If there’s a link in the post, it seems more relevant to look into the “link quality test”.

    6. Jul 6th, 2008

      Crosbie Fitch

      If you’re going to follow the links, then if they’re disjoint that marks it as spam. However, if despite that you can’t tell the difference, then don’t worry about it. You are then in the realm of editorial selection and must consider redacting comments with little value, i.e. reduce the fluff to icons (either a single icon, or if you want to do some work create several icons, e.g. ticks for positives, crosses for negatives, and light bulbs for neutral).

      Only extremely popular bloggers get fluff, so you probably don’t have to worry about it. Either you’ll get obvious spam, sophisticated spam as trite comments with links to poker sites, or you’ll get thoughtful comments.

    7. Jul 6th, 2008

      Kris

      Could this be generated using Markov chain random text from your own posts and previous comments? I just recently read about this in Programming Pearls; I think the method is well-known.

      The technique can generate some creepy human-looking output, but it is just words based on their frequency of occurrence in the neighborhood of other words.

    8. Jul 6th, 2008

      Assaf

      @kris. I seriously doubt these are Markov chains, the word distribution is off. But now that you bring it up, I’ll keep an eye open for the first Markov spam comment,

      @crosbie. I’m using WordPress, not the best tool for redacting comments, too many steps involved, and I’d rather not spend my energy on editing spam. But like Karl pointed out on his blog, that’s also an opportunity. Maybe a startup, or at least a better plugin.

    9. Jul 7th, 2008

      Greg

      So in a nutshell, any other site that has space for users to post replies (digg, reddit etc), and which links directly to this post, can gather legitimate comments, and a spammer can take those comments from there, post them here and attach their dodgy website link? (and vice versa I suppose)

      I’m surprised stuff like that hasn’t happened sooner, really.

    10. Jul 7th, 2008

      Crosbie Fitch

      Sorry Assaf, I was being general, suggesting that when a blogger has so many comments that the fluff and borderline intellispam becomes a problem, then features such as redaction need to be added to the list of options provided to blog moderators. I’m not trying to persuade you that YOU need to knock such WordPress plugins up yourself.

    11. Jul 7th, 2008

      Karl Fogel

      Some kind of automated trackback mechanism caused an excerpt from the top of your post to appear as a comment on my blog (at the post you reference), here:

      http://www.rants.org/2008/06/23/spam_insidy/#comment-25880

      But because the trackbacker doesn’t really understand either post, nor get quoting conventions right, etc, the comment it left on my blog looks very much like one of the semi-spam comments we’re discussing here. Which is kind of hilarious :-).

      I know it wasn’t intentional on your part, of course. I’ve left another comment following up to it and pointing to this comment.

      I really like Crosbie’s idea of reducing semi-spam and fluff comments to expandable icons by default (I guess if they get followups, they could be automatically expanded, on the theory that receiving a followup indicates that the original comment had more value than the editor originally deemed). I hope that CMS and blog software offers this as an option soon.

      -Karl

    12. Jul 7th, 2008

      Assaf

      Once again, I failed the turing test :-)

      What I meant re: redaction is that I don’t want to deal with it. Not the editing part of it, nor accommodating for it in the stylesheet. And I don’t have threaded comments, for the same reason. If it looks like spam, I’d rather delete it than work out how to present it.

      The more I think about it, the more I get fixated around the drive-by comment problem.

      Let’s take a simple case, I write a post asking people to yay or nay it in the comments. Somebody does a drive-by comment and leaves “good idea, worth doing”. If this is a real opinion, than it’s adding value to the post, no doubt about it.

      But it could just be a boilerplate comment they leave on any blog they come across, in hopes it will match and stick. By chance it does, and until I see the very same comment on another blog I’m managing (that happens a lot), I might think it’s adding value.

      Anyone who read the comment and did a yay/nay count got a wrong data point, could we say it’s actually removing value by being there? Redacting after the fact doesn’t affect people who already read the content. It’s worse than trying to correct a false rumor after it made the viral round through all the social sites.

    13. Jul 8th, 2008

      Crosbie Fitch

      As I said, if it is impossible to determine whether a comment is genuine or spam then it doesn’t matter.

      If, perversely, you explicitly invite fluff/trite comments, as in “What say you good people? Yay or Nay?”, then you do indeed have to count the yays of fluff spammers. It’s a bit of a contrived example of yours. Your unwritten editorial policy is probably to deprecate “I agree” fluff – newbie commenters that indulge in it soon learn that it’s of little worth (until such time as reputation metrics reveal the true import of ‘I agree’).

      For those who moderate comments, I’d imagine a redaction button/option would only be used in the situation where you aren’t sure and would rather not waste time angsting about whether the potential offence and backlash caused by deletion of an off-topic/trite/fluff/possibly spam comment outweighs the irritation of its presence to other readers. Redaction would convert it to a weeny icon, with alt-text hover excerpt, and with javascript onclick popup for the entire comment (links auto converted to text).

      I think you recognise that you are going to have to deal with this problem (even though you don’t want to). I’m just trying to suggest that the quickest way of dealing with it is to avoid it, and simply redact the borderline cases. The other cases take a moment anyway. Redaction is a way of also only taking a moment.

      A) Fair comment (publish)
      B) Spam, abuse, obscene, etc. (unpublish/delete)
      C) Hmmm. Not sure… (redact)

      Then, because the redacted comment retains its integrity (links despite being in text form can be reconstituted), it’s also an easy thing to have an Unredact option, e.g. when you discover that it was an important dignitary who left an apparently off-topic comment under what they thought you’d recognise was a well known alias. Perhaps George Clooney might submit a vaguely relevant comment about styling his new online casino website [link] under the alias Clooney Tunes?

    14. Jul 8th, 2008

      Assaf

      I picked that example to isolate the case where you clearly can’t tell a spam by the content of the comment, without getting into the specific of a post. Don’t get caught into the fluff/trite part of it. A drive-by comment could read “this seems to not work well on PHP 4 with MySQL 5″, which is adding value if someone actually used that configuration, and wasting everybody’s time if they just made it up.

      Eventually I have to make a judgement call. I sometimes sit on a comment for a day or two until I decide (say the target site is down), I can just leave it as moderated. New information may change that decision, but that hasn’t happened yet, so it’s hypothetical.

      So what’s bothering me now is not the mechanism, but the decision which will eventually be made.

      Say the comment is by Joe and Joe’s URL goes to an online poker site. Obviously spam that deserves to be nuked, right?

      What if Joe, real person, real name, actually reads my blog but always links back to their employee, since Joe is reading my blog on company time? How is that different from someone working for IBM or their own startup commenting with a company link?

      I worked on a few WordPress plugins. Splogs are still blogs, and blogs use plugins, so what happens when I get a splog commenting on a plugin they might actually be using? (That’s not a hypothetical question)

    15. Jul 8th, 2008

      Assaf

      Did I mention not hypothetical?

      Here’s a comment I just got on a blog post discussing login issues (not on Labnotes):

      “I too had faced this problem initially then one of my friends suggested me that before setting the password I must clear the browser. And after that every thing was fine”

      They left name, email address and URL linking to the Bangalore Management Academy.

      Real comment, drive-by comment or automated spam?

    16. Jul 8th, 2008

      Crosbie Fitch

      If in doubt, don’t nuke it, redact it. If it’s fluff and links to a poker site, well, you will develop your own heuristics to select between nuking and redacting.

      For boilerplate comments that through chance are apposite, we need another plug-in, a Google search showing the moderator how likely it is to be boilerplate (used elsewhere). However, boilerplate comments aren’t really a problem. The value you add is how you respond to them. You don’t remove value from your blog by unwittingly publishing boilerplate. If the boilerplate prompts an interesting response from you then that’s great. If however, your response is to criticise the boilerplate for not making sense or to ask its author to provide a little more evidence then the ball’s in their court. The boilerplater is already losing value through their unthinking behaviour – even if no-one sees the misses because they’re deleted, the duplication of the boilerplate comment will eventually surface and indicate the comment is spam and reduce the reputation of anything it’s trying to promote. This still won’t negate the value of any comments in response, nor make readers think less of you for being ‘caught out’ by a boilerplate comment.

      Even if we became overrun with Elizabots that submit querying comments to bloggers (without otherwise tipping the blogger off that it is a bot), the bloggers’ responses are still valuable (because they’ll only be given in response to apparently good comments). A comment that is good by accident is still a good comment.

      Everyone still has to verify comments anyway (before acting upon them). The blogger doesn’t lend their reputation to commenters, they only pass comments that may reflect the views of real people and may be interesting to others. It’s not like you’re publishing a peer reviewed journal in which the academic credentials of each commenter must be ascertained to demonstrate they are true peers.

      One thing that’s currently different between intellispammers and Elizabots, and is that unlike the latter, the former do not yet have an energy budget sufficient to engage in any resulting dialogue. Even when they do, or when even Elbonians start getting paid pennies for having jejune repartee with bloggers simply to slip in some promotional links or keywords, the blogger still has an energy budget that dictates how many comments are worthy of their consideration or worth replying to.

      A far trickier problem is what a human commenter is to do when a blogger specifically rejects their further comments, or even erases their historical comments.

      I fancy a far more peer-to-peer+reputation metric based means of blogging – and trackbacks still don’t cut it. Then somewhat like Slashdot, all participants can decide if certain other participants are spammers, religiously/politically unpalatable (subjective), or by default worthy of being considered their peers. Perhaps this is a job for Google? Maybe they’re already working on it?

    17. Jul 8th, 2008

      Assaf

      Let’s do something else and see if we can apply a test to it. Because essentially, that’s something I run in my head before deciding whether to junk a comment that’s likely spam.

      Let’s assume I can predict what the drive-by comment would say on a given post. Most people read the post but don’t come back to check on comments. If the comment has value, then I can amplify that value tremendously by slapping it to the end of the post. That way, it becomes visible to everyone, and some people will even take it as more authoritative.

      So look at the sample comment above, about clearing the cache. That one is a good reflection of most of the drive-by comment I get. It’s not a hypothetical what if, but an actual spam catch. And I already have a couple of cases where, because of multiple drive-bys, I can predict the comment from the post.

      So given the benefit of the doubt that it has value, I’m going to amplify that value by adding this piece of text to the next post that I predict will attract a comment like this. Good or bad?

    18. Jul 9th, 2008

      Crosbie Fitch

      Let’s not forget you will have a moderation policy that becomes clear to regular readers, e.g. a) publish all comments, b) weed out spam after publication, c) no comment is published without having been scrutinised, d) other.

      I am only considering that peculiar case in which you are not sure if a comment is spam.

      In the situation you describe you appear to be quite able to recognise the spam nature of these ‘drive-by’ comments. So I’m not sure what analysis you’re expecting from me.

      Anyway, I don’t think you make commenters or their comments more authoritative by publishing/not removing coincidentally valid drive-by comments (you may however indicate their comment is indeed valid, apposite, sufficiently interesting, contributes to the discussion, etc.). Readers recognise that commenters’ knowledge and intelligence greatly varies, from bearded experts down to newbie code monkeys – and now down to Elbonian undercover SEO operatives.

      Does it matter if a comment is a FAQ asked by someone who seeks only vacuous nofollow-free dialogue rather than enlightenment? If you think it’s worth answering, go for it. What can you lose? Do you feel dirty come the day you realise you’ve been talking to an Elizabot/Moron/Elbonian/Replicant? It doesn’t matter. You are the one who is valued, as is what you say – and this isn’t reduced by being seen to talk to imbeciles or spammers – as long as what you say is valuable. You don’t have to go back and delete the conversation once you’ve found out the skin colour of the person you were talking to, their IQ, or their motivation.

      Of course, you may be on a mission to engage in counter-SEO measures because you simply hate the thought your blog might give a tadette more prominence to a spammer in Google’s analysis.

      What is your purpose in moderating commments? To deny spammers link love? Or to ensure discussion isn’t distracted by advertising or nonsense?

    19. Jul 9th, 2008

      Assaf

      I can tell when a comment is spam with high certainty. Which is the same as saying “I’ll be wrong on occasion.” So it’s the borderline case I’m worried about, and maybe I judged wrong, or got overzealous clearing too much spam, or didn’t get enough sleep that night.

      That’s why I’m advising people to make sure their comments don’t trigger my spam finger.

      The reason I proposed this test is because it’s easy for me to reason about it. It divides comments into two broad categories.

      If the comment is valuable by contents alone, then it would be even more valuable if I put it in the body of the post. Not that I plan on doing it, just that most time it’s easy to classify by this test alone.

      So far drive-by spam comments fail that test, which brings us to the next category. It now matters who wrote the comment. Again, simple cases, “I love your new theme!” If it comes from a reader, that’s valuable, I want my theme to appeal to readers. What if it comes from a spammer who thinks it actually stinks and hard to navigate, but decided on flattery instead because they just want a place for their ad?

      You’re throwing a party at your house and inviting your friends over. How would you feel if someone walked in uninvited and start handing out flyers for online-flowers-cheap, or sell knock-off designer bags, or tell everyone about Elbonia Metals & Glass Inc?

      What if they started each sales peach with an insightful quote from the Book of Wisdom, or just told every person to reboot the Windows box regularly?

    20. Jul 10th, 2008

      Crosbie Fitch

      a) Is it clearly interesting, adding to the conversation? Yes: PUBLISH.

      b) Is it clearly nonsense or promotion (at best only tangentially related)? Yes: DELETE.

      c) In all other cases redact.

      In class (a) there will be some comments that promote a product, some that are boilerplate, some that are remixes of other comments from related sites, some that are FAQs submitted by spambots or Elbonians, etc. It doesn’t matter. If it met your criteria for a good comment, it is a good comment.

      If, surprisingly, you have enough time to analyse a commenter’s motivation and objective by researching their name, e-mail and links, you can do so, but I see little point in this activity, except perhaps when the comment inspires a response from you.

    21. Jul 10th, 2008

      Assaf

      I don’t have time, which is why sometimes I’m going to err and mark comments as spam against my best wishes. Just happened yesterday over email, I didn’t recognize the name and the short message looked like boilerplate spam, so it went to my spam folder.

      I moderate by email, and like Karl said so well, its a stream of spam and my time it spent picking out the ham. So if it looks like spam at quick glance, it gets nuked. If it doesn’t, then I stand by my decision to publish it.

      Right now I’m taking the time to reflect on these comments and where I draw the line. I’m still not entirely sure where it should be, but once I do, we’re back to regular business. And regular business is me skimming my overflowing inbox and making flash decisions.

      And making false positives.

      The point is not to allow spammers in, even if their content is half decent. I don’t care. I check my server logs and I know they’re not just pushing content in, they’re checking if it shows up. Allowing it to show up paints me as a target (yay, more spam!) and helps them improve their mechanisms.

    22. Jul 11th, 2008

      Crosbie Fitch

      Yes, but we’re right back to where we started.

      The original issue was the deficiency in the moderation process of GOOD/BAD(spam/nonsense). If you find this binary process is still workable, and believe that it will always be workable for you, well, excellent.

      All I’m saying is that if the borderlines are taking up too much of your precious time (or you starting worrying that you are deleting genuine comments and building up a backlash from newbies who haven’t learnt to avoid sounding like ever more intelligent spammers) there is another process: GOOD(pass)/BAD(delete)/UNSURE(redact).

      I’m merely proposing that. It may have fundamental flaws, but it seems like it might be worth a try one day when some moderators have an uncertainty crisis.

    Your comment, here ⇓