The spam… oh the spam…

Posted in Infrastructure by Ben Walding on January 11, 2010 No Comments yet
Overview
Running a site like Codehaus presents many challenges; some are solved easily; some are not.
One of the recurring themes is the necessity of dealing with spam.  Since I doubt that any of the moronic spammers that hit our site read our tech articles, this article shouldn’t open us up to additional risk.
Spam hits almost every one of our services
* email – 100k spam per day
* web – referrer spam
* confluence – profile spam; comment spam
Today we’ll look at Confluence spam and how we deal with it; and how YOU as a Confluence user can help ensure that the site remains spam free.
Defence 1: User name
Our first line of defence is a trigger on the USERS table in the PostgreSQL database.  This trigger prevents the most obvious of spam accounts from even registering.  It does cause an ungainly error 500 in the front-end – but most spam bots won’t be put off by that.
{code}
{code}
Defence 2: Profile check
Watch out for things like cialis matching “specialist” – if the user has authored content, then it doesn’t really matter though (as they won’t be removed in the script below)
This will take a while as it scans all the profile content
{code}
SELECT DISTINCT C.username
FROM CONTENT C, BODYCONTENT BC
WHERE content_status = ‘current’
AND contenttype = ‘USERINFO’
AND C.username NOT IN (‘good-user’)
AND bc.contentid = c.contentid
AND LENGTH(bc.body) > 100
AND (
body LIKE ‘%porn%’ OR
body LIKE ‘%cialis%’ OR
body LIKE ‘%incest%’ OR
body LIKE ‘%viagra%’
);
{code}
The result is a list of users who may / may not be spammers (but on the balance of probabilities they are)
doctormexico
parishiltonsuck
from-paris-hilton-bff-naked
paris-hilton-sextape-clips
paris-hilton-video
wild-things-clip-denise-richards
Dealing with the users
We then use an automated script with all the user IDs that have been gathered in our previous layers and
do a basic user removal.
This works well because only users that have not authored content will have their accounts removed.  This is generally not
especially bad for users who have accidentally matched the ruleset, but are useful users who have authored content.
Defence 5: Whitelist group
Within Confluence we have a group CAPTCHA-FREE – they are not subjected to a captcha when adding comments. All users registered by Xircles are added to this group; and there is an interface inside Xircles to add extra users.
and a group codehaus-whitelist; users in the codehaus-whitelist are not affected by various anti-spam activities. This is required for the odd occasion where a user matches a rule; but is not a spammer. We can quickly add these users to the whitelist so that our process remains clean and thoughtfree.
Defence X: Atlassian
SELECT C.contentid, C.username, length(BC.body), body
FROM CONTENT C, BODYCONTENT BC
WHERE content_status = ‘current’ AND
contenttype = ‘USERINFO’
and c.username in (‘yecarrillo’)
AND bc.contentid = c.contentid
AND (
body LIKE ‘%porn%’ OR
body LIKE ‘%viagra%’ OR
body LIKE ‘%incest%’
)
limit 1
;

Running a site like Codehaus presents many challenges; some are solved easily; some are not.

One of the recurring themes is the necessity of dealing with spam.  Since I doubt that any of the moronic spammers that hit our site read our tech articles, this article shouldn’t open us up to additional risk.

Spam hits almost every one of our services

* email – 100k spam per day – we block with a variety of techniques

* web – referrer spam – we ignore / block prolific spiders

* confluence – profile spam; comment spam

In this post we’ll look at Confluence spam and how we deal with it; and how YOU as a Confluence user can help ensure that the site remains spam free.

Like most internet systems we use defence in depth against spam.

The techniques shown below can be applied to almost any 3rd party (or inhouse application) without modification to the core product.

Defence 1: IP addresses

As per the Atlassian guidelines for JIRA – http://svn.atlassian.com/svn/public/contrib/jira/spamfighting/spammers/blockspammers.rc – we reprocess the block file using a basic Ruby script to convert the IP tables blocklist into an HTTPD blocklist.

The spammers.conf is then “included” in any of our sites that needs to be protected from spam – which is most of them.

See update-blocklist

We also have a second list of spammers that have harassed our sites. We also include comments as to when and why a block is added, in case we unintentionally block legitimate users when we block Class A/B/C networks.

order allow,deny
deny from 91.212.226.0/24 #spammed ActiveIO 20070105
deny from 89.120.0.0/16
...
allow from all

Defence 2: Prevent spam user names

Since prevention is far better than a cure, we then prevent zombie machines from creating users with typical spam names. This is a never ending battle of course – but since we are not a high priority target (we hope) – we can get away with pretty simple defences.

Since Confluence does not support a blacklist for names, we use an insert trigger on the “USERS” table in the PostgreSQL database.  This trigger prevents the most obvious of spam accounts from even registering.  It does cause an ungainly error 500 in the web interface – but most spam bots won’t be put off by that.

user_insert_trigger

The above trigger has been pruned to keep it brief – we have more rules in our live triggers.

Defence 3: Destroy spam user names

As with all things; a static defence is usually defeated over time. We supplement our user name trigger with a script to lookup usernames that match a variety of more vague rules.

The use of broader rules in this phase is intentional – removing accounts is less problematic than blocking users who may trigger a spam rule by accident.

e.g. we need to watch out for things like “cialis” matching “specialist” – if the user has authored content, then it doesn’t really matter though (as they won’t be removed in the script below)

This will take a while as it scans all the profile content

SELECT DISTINCT C.username
  FROM CONTENT C, BODYCONTENT BC
 WHERE C.content_status = 'current'
   AND C.contenttype = 'USERINFO'
   AND C.username NOT IN ('good-user') -- the whitelist
   AND BC.contentid = C.contentid
   AND LENGTH(BC.body) > 100
   AND (
     BC.body LIKE '%porn%' OR
     BC.body LIKE '%cialis%' OR
     BC.body LIKE '%incest%' OR
     BC.body LIKE '%viagra%'
    );

The result is a list of users who may / may not be spammers (but on the balance of probabilities they are):

  • doctormexico
  • parishiltonsuck
  • from-paris-hilton-bff-naked
  • paris-hilton-video
  • wild-things-clip-denise-richards

Defence 4: Dealing with the junk users

We then use an automated script with all the user IDs that have been gathered in our previous layers and do a top level user removal.

This works well because only users that have not authored content will have their accounts removed.  This can inconvenience the occasional user with an odd name that hasn’t authored content, but on the balance of probabilities it achieves the end-goal.

The script can be found at https://svn.rubyhaus.org/confluence4r/trunk/examples/remove-users.

The script is coded in Ruby (because that’s how we roll); and uses our opensource confluence4r library for interfacing with Confluence.

Simply feed it a set of parameters like:

./examples/remove-users http://confluence.example.com admin password spammerlist.txt

The script will connect to Confluence, authenticate, and then attempt to remove all the listed users.

It would be trivial to write an XML-RPC client that does the same thing in fewer lines of code; but the script shown above is similar to a lot of other internal management scripts.

Defence 5: Whitelist group

Within Confluence we have a group CAPTCHA-FREE – they are not subjected to a CAPTCHA when adding comments. All users registered by Xircles are added to this group; and there is an interface inside Xircles to add extra users.

We also have a group “codehaus-whitelist”; users in the codehaus-whitelist are not affected by various anti-spam activities. This is required for the odd occasion where a user matches a rule; but is not a spammer. We can quickly add these users to the whitelist so that our process remains clean and thoughtfree.

Defence 6: Proactive Users

If you see spam in Confluence / JIRA, please let support know as we can’t be everywhere at once.

Just send the offending URL and any other information as required.

Defence 7: The Future

We frequently have to revisit our approaches – spammers are persistent lot – and Confluence / JIRA are likely to attract specialised attention from them in the future.

Post to Twitter Tweet This Post Post to Delicious Delicious

Confluence 3 deployed

Posted in Uncategorized by Ben Walding on September 27, 2009 No Comments yet

Upgrade

Just a quick post today (as opposed to no post at all!)

Yesterday we deployed the latest version of Confluence to http://docs.codehaus.org/ (this was preceded by an upgrade to the latest stable PostgreSQL as well).

The result of this is that we are now running the latest and greatest Confluence (3.0.1) – and you now have access to the features contained within.

New Features

If you’re interested in the new features in Confluence 3; then head on over to Atlassian’s documentation.

Activity Streams

Activity streams are only partially enabled at present. We are looking at ways of integrating activity streams across the Codehaus platform (i.e. covering all our platform – and not just the Atlassian parts).

Support

The usual story applies to support – head on over to http://codehaus.org/support/ for more information on support channels.

Post to Twitter Tweet This Post Post to Delicious Delicious

Continuous Integration

Posted in Infrastructure, Services by Ben Walding on August 1, 2009 No Comments yet

You may or may not know, but Codehaus has had a continuous integration server running for quite some time now.

If you’re just looking for the URL, here it is – http://bamboo.ci.codehaus.org/

History

In the very early days (pre 2004), there was a server hogshead (in fact there still is). Hogshead provided all services for Codehaus; although there were only a handful of projects and none of them had a core requirement for continuous integration.

In the 2005-2006 era, beaver (at Sentex) was provisioned, and a handful of projects did unauthorised CI style builds on the machine.  As this took out core services, the users doing builds were summarily executed.

Let us not speak of the hard drive crash of May 2006.

With the migration to managed hosting at Contegix, we no longer had infrastructure that could safely run end-user jobs. SimulaLabs provided a machine called cheddar to perform CI builds. A variety of CI tools were used on this machine continuum, the Andy Pols precursor to Bamboo (whose name eludes me at present) and finally early versions of Bamboo.

Then cheddar was relocated to a new data centre without warning; and ultimately switched off without warning.

This was untenable and Contegix came to our aid and helped us with codehaus04 (finally we have sensible names!).

codehaus04

codehaus04 is a 4 core machine (Intel(R) Xeon(R) CPU @ 1.86GHz) with 4G of physical memory, 130G of disk and access to the Codehaus internal network for moving data around.

Bamboo

Bamboo can be viewed at http://bamboo.ci.codehaus.org/

We can’t keep up with the Bamboo documentation, so it is recommended that you take a look at the Atlassian Bamboo documentation.

Access

If you require access to create a plan; and you are a project despot; simply raise a JIRA requesting access for your project. All despots are typically granted rights to create new plans.

If you are not a despot; contact one of your despots to get the plan created. They can then grant you management rights for the plan.

Configuring Bamboo for Codehaus eccentricities

Please see the following related articles for more information on configuring Bamboo to work with Codehaus infrastructure:

Git Support

If you have an existing build you wish to switch to Git; then clone the build or create a new build. Do not change the source repository – there is something odd inside Bamboo / Git that will cause your build to go haywire and repeatedly build (and email).

.Git repository configuration

The default configuration for Bamboo is pretty straightforward; just follow the prompts and ping support if you have any questions.

Hudson

Hudson will likely be viewed at http://hudson.ci.codehaus.org/ (HOWEVER IT IS NOT AVAILABLE YET!!!)

Hudson support will be coming soon. Stay tuned for more information!

Post to Twitter Tweet This Post Post to Delicious Delicious

Next Page »