Running a site like Codehaus presents many challenges; some are solved easily; some are not.
One of the recurring themes is the necessity of dealing with spam. Since I doubt that any of the moronic spammers that hit our site read our tech articles, this article shouldn’t open us up to additional risk.
Spam hits almost every one of our services
* email – 100k spam per day – we block with a variety of techniques
* web – referrer spam – we ignore / block prolific spiders
* confluence – profile spam; comment spam
In this post we’ll look at Confluence spam and how we deal with it; and how YOU as a Confluence user can help ensure that the site remains spam free.
Like most internet systems we use defence in depth against spam.
The techniques shown below can be applied to almost any 3rd party (or inhouse application) without modification to the core product.
Defence 1: IP addresses
As per the Atlassian guidelines for JIRA – http://svn.atlassian.com/svn/public/contrib/jira/spamfighting/spammers/blockspammers.rc – we reprocess the block file using a basic Ruby script to convert the IP tables blocklist into an HTTPD blocklist.
The spammers.conf is then “included” in any of our sites that needs to be protected from spam – which is most of them.
See update-blocklist
We also have a second list of spammers that have harassed our sites. We also include comments as to when and why a block is added, in case we unintentionally block legitimate users when we block Class A/B/C networks.
order allow,deny deny from 91.212.226.0/24 #spammed ActiveIO 20070105 deny from 89.120.0.0/16 ... allow from all
Defence 2: Prevent spam user names
Since prevention is far better than a cure, we then prevent zombie machines from creating users with typical spam names. This is a never ending battle of course – but since we are not a high priority target (we hope) – we can get away with pretty simple defences.
Since Confluence does not support a blacklist for names, we use an insert trigger on the “USERS” table in the PostgreSQL database. This trigger prevents the most obvious of spam accounts from even registering. It does cause an ungainly error 500 in the web interface – but most spam bots won’t be put off by that.
The above trigger has been pruned to keep it brief – we have more rules in our live triggers.
Defence 3: Destroy spam user names
As with all things; a static defence is usually defeated over time. We supplement our user name trigger with a script to lookup usernames that match a variety of more vague rules.
The use of broader rules in this phase is intentional – removing accounts is less problematic than blocking users who may trigger a spam rule by accident.
e.g. we need to watch out for things like “cialis” matching “specialist” – if the user has authored content, then it doesn’t really matter though (as they won’t be removed in the script below)
This will take a while as it scans all the profile content
SELECT DISTINCT C.username
FROM CONTENT C, BODYCONTENT BC
WHERE C.content_status = 'current'
AND C.contenttype = 'USERINFO'
AND C.username NOT IN ('good-user') -- the whitelist
AND BC.contentid = C.contentid
AND LENGTH(BC.body) > 100
AND (
BC.body LIKE '%porn%' OR
BC.body LIKE '%cialis%' OR
BC.body LIKE '%incest%' OR
BC.body LIKE '%viagra%'
);
The result is a list of users who may / may not be spammers (but on the balance of probabilities they are):
- doctormexico
- parishiltonsuck
- from-paris-hilton-bff-naked
- paris-hilton-video
- wild-things-clip-denise-richards
Defence 4: Dealing with the junk users
We then use an automated script with all the user IDs that have been gathered in our previous layers and do a top level user removal.
This works well because only users that have not authored content will have their accounts removed. This can inconvenience the occasional user with an odd name that hasn’t authored content, but on the balance of probabilities it achieves the end-goal.
The script can be found at https://svn.rubyhaus.org/confluence4r/trunk/examples/remove-users.
The script is coded in Ruby (because that’s how we roll); and uses our opensource confluence4r library for interfacing with Confluence.
Simply feed it a set of parameters like:
./examples/remove-users http://confluence.example.com admin password spammerlist.txt
The script will connect to Confluence, authenticate, and then attempt to remove all the listed users.
It would be trivial to write an XML-RPC client that does the same thing in fewer lines of code; but the script shown above is similar to a lot of other internal management scripts.
Defence 5: Whitelist group
Within Confluence we have a group CAPTCHA-FREE – they are not subjected to a CAPTCHA when adding comments. All users registered by Xircles are added to this group; and there is an interface inside Xircles to add extra users.
We also have a group “codehaus-whitelist”; users in the codehaus-whitelist are not affected by various anti-spam activities. This is required for the odd occasion where a user matches a rule; but is not a spammer. We can quickly add these users to the whitelist so that our process remains clean and thoughtfree.
Defence 6: Proactive Users
If you see spam in Confluence / JIRA, please let support know as we can’t be everywhere at once.
Just send the offending URL and any other information as required.
Defence 7: The Future
We frequently have to revisit our approaches – spammers are persistent lot – and Confluence / JIRA are likely to attract specialised attention from them in the future.
