[dokuwiki] Re: Anti spam brainstorming

  • From: Sander Tekelenburg <tekelenb@xxxxxxxxxx>
  • To: dokuwiki@xxxxxxxxxxxxx
  • Date: Fri, 10 Nov 2006 02:10:15 +0100

At 22:35 +0100 UTC, on 2006-11-09, Andreas Gohr wrote:

[...]

> I like to find solutions against automated spam without using CAPTCHAS
> first but we possibly should create a CAPTCHA plugin anyway.

If you do, please make sure that such a plug-in gives a clear warning that
visual CAPTCHAs makes inaccessible to blind or otherwise sight-impaired
users. (And textual CAPTCHAs can be a hurdle for non-native speakers.)

[...]

> As a first start I just added my revert plugin to darcs. It's in a very
> rough state and needs to be improved, but I think we need to include
> this functionality in the next release to give people a way to quickly
> revert spam.

An easy to use revert mechanism would definitely be useful. It could possibly
be extended by hooking it up to something that attempts to recognise spam and
can send a warning (by mail, RSS, whatever) to the admin in such cases. An
option to have edits that are suspicious automatically quarantained, awaiting
admin approval before being published, might also be useful.

Btw, what if after having been spammed you also received legit edits?
Wouldn't a revert kill those? It might be more practical to allow admins to
mark an edit as "junk" and have it and all its traces (page history) be
deleted.

Or perhaps instead of deleting it, move it to some non-public spam repository
which can be used to learn from. You could optionally even allow different
Dokuwiki's to make use of each other's spam respositories -- at the risk of
some day the spammer running his own Dokuwiki to pollute the community's
respository...

> I asked at the WikiMatrix forum [1] for other Wiki authors solutions.
> Peter Thoeny pointed me to a blacklist [2] used by MoinMoin, TWiki and
> MediaWiki. This list is much bigger than the one from chonqued which
> DokuWiki uses currently. But both lists differ - using both results in
> blacklist of about 400kb - quite heavy. And a blacklist is no 100%
> safety.

If you're gonna aim for 100% of anything you can only fail ;) Blacklists have
their downsides but in general seem to be helpful. Even if such an approach
only catches 30% of spam it's well worth it.

There wil never be a 100% sure way to catch spam, and I dout there will ever
be a single mechanism that can catch 95%. But well-chosen combinations of
mechanisms can perhaps result in 95% being caught.

[...]

> Another idea is to implement some surge protect against many edits in a
> short time. The recent spammings used many different IP addresses so we
> can not bind this to a post-per-ip limit. Any ideas?

I agree with Harry Fuecks that the spammer will probably use his borg to send
one and the same message within a relatively short period of time. Certainly
the form-based spam I receive seems consistent with that. So perhaps, instead
of binding to IP which indeed is probably useless, Dokuwiki can bind to
content; have Dokuwiki recognise when the same content is posted from
different addresses within a (configurable) short period of time.

> Some Wikis implement another check which do not allow posting too much
> links in one edit.

That would probably affect too many legit uses. It's better to receive spam
than to turn away users.

[...]

> The ip's used seem to be either trojaned PCs or open proxies. The bad
> behavior plugin already checks some blackhole lists but those blacklists
> are for open mail relays. Maybe a different blackhole list like [5]
> could work better. Problem with those lists are legit users getting a
> blocked dynamic IP address.

Perhaps one or more of those lists do this already, but it seems to me that
scoring might be helpful here. An IP address could get points for each abuse
it generates. When that score hits a certan threshold, the IP gets
blacklisted. After not generating abuse for x period, its score is slowly
reduced again until it crosses the threshold and gets unblacklisted. As long
as no further abuse from that IP address is noted, its score keeps on getting
lowered steadily. If new abuse is found, the score goes up again.

Seems to me this could lessen the risk of blacklisting legit users, or at the
very least make it easier for them to get unblackslited again.

> Maybe we can learn from the methods used in fighting email spam.
> Bayesian filters might work but training them might prove complicated.

FWIW I think Bayesian filtering could definitely be helpful. Especially when
combined with other tools. I'm guessing the wikis that get targeted the most
are likely to be the ones that have the most content; content you could use
as the positives for the Bayesian filter.

If you combine this with a mechanism to allow admins to mark edits as
junk/spam, have the Bayesian filter learn from that, then teaching would
become a no brainer because people will bother to delete such edits anyway --
they'll just need to make sure they do so by marking them as spam. Probably
not too hard if every page has a "mark as spam" button (only when you're
logged in as an admin).

FWIW, my email use is entirely based on (Eudora's built-in) Bayesian
filtering, which catches some 80 to 90% with virtually no false positives.
Bayesian filtering seems to be quite effective. (Although granted, it will
probably work better with some content than with other content. For a wiki
about Viagra Bayesian filtering probably won'ty be too useful ;))


Hope this is useful.


-- 
Sander Tekelenburg, <http://www.euronet.nl/~tekelenb/>
-- 
DokuWiki mailing list - more info at
http://wiki.splitbrain.org/wiki:mailinglist

Other related posts: