GDPR, four letters that when combined strike fear into the heart of any sysadmin. Luckily, there is quite some time before it comes into force, which means getting into the habit of complying should be natural by 25th May 2018. My default position on these types of regulations are to consider it from a consumer’s point of view, and think about how I would feel with someone holding personal data of mine for longer than necessary.
Let’s see what the guidance says for an individual: “Personal data is any information that can identify an individual person. This includes a name, an ID number, location data (for example, location data collected by a mobile phone) or a postal address, online browsing history, images or anything relating to the physical, physiological, genetic, mental, economic, cultural or social identity of a person.”
Apache web server
System administrators of publicly accessible webservers rely very heavily on logs, but most of the time, the defaults are capturing more than is absolutely necessary; around 70% of web requests are successful, let’s remove end-user data for those at least. The Apache web server allows us to modify the log format and remove the client IP address/user-agent altogether, and/or only log for certain return codes. Using
mod_log_config, which is loaded by default, we can set something like this to only log a remote IP address when a request does not have a 200/302 return code:
LogFormat "%!200,302h %!200,302l %!200,302u %t \"%r\" %>s %b " common
# Mark requests from the loop-back interface
SetEnvIf Remote_Addr "127\.0\.0\.1" dontlog
# Mark requests for the robots.txt file
SetEnvIf Request_URI "^/robots\.txt$" dontlog
# Log what remains
CustomLog logs/access_log common env=!dontlog
With this in place, my Apache web server will only log identifiable information when a request fails, which – let’s face it – are the only times I’m going to care. The next piece of the puzzle is how long do I really need to keep this? There are two approaches: keep logs for a short period – say, thirty days – then delete them, or alternatively we could keep logs for a long time but just anonymise them further.
Luckily, most LAMP stacks already have logrotate installed with a default configuration, so it’s as simple as editing
/etc/logrotate.conf for your apache logs:
For our second option, we can make use of a preremove script which is run before a log file is deleted, to parse a file and redirect the output to a new log file. I’ve used the ‘cut’ tool to grab everything from the 4th column onwards in this example:
<$1 cut -d' ' -f4- >/var/log/httpd/access_log_anon
So that’s a rough outline for reducing the amount of data you store within your webserver that would be relevant to GDPR, and to only keep that which is absolutely necessary.
Our next issue is internal databases. Oftentimes we find that data is kept for far longer than is necessary and not for the right reasons.
A classic example is demographics; any organisation wants to know who its customer are, what their socio-economic background is, where they are based and what they have purchased or inquired about. This let’s Sales and Marketing target the average consumer and in theory know what they are going to want in the future. Now, most of the time this data is associated with existing or prospective customers, and up until recently that wasn’t such a bad thing. When asked to produce data for strategic decision making or Sales and Marketing, think about what is being handed over. Generalise data where it isn’t already and consider storing it as such.
The new age of GDPR means that as system administrators and database managers, we have to assess what is absolutely necessary and it’s easiest to start with stagnant data, that of dormant/inactive customers. Find out if you have data on inactive customers and clean it out – this doesn’t necessarily mean deleting it, but archiving it with restricted access is a good first step. Look at your backup procedures and make sure that when databases are backed up, that they are cleansed too.
Go through your database schemas and look for common datum that is unnecessary for day-to-day operations. Oftentimes this will be fields in off-the-shelf software that people feel compelled to fill with data, just because a blank space really irks a certain type of person. Once you’ve identified unnecessary data, get rid of it.
If you have teams that need raw access to your data, consider their final aim – you’ll find that most people don’t realise they are over-reaching in their requests. If they are looking for trends across geographical regions, they don’t need contact details or specific purchasing data, for example. Think very carefully before granting access to raw data and ideally limit information transfer as much as possible, ideally not exporting data to offline files or any other uncontrolled medium which is a very difficult line to toe.
Rather than producing extracts from your database as CSV/XLS files, consider generating distinct SQL views and manage access to those instead. SQL views are a powerful tool in controlling access to privileged data; not only do they let you hide some of the complexity of a given dataset, but you can also control access and maintain integrity. Top tip: when creating a view within your database, add a validity to the WHERE clause, pseudo-SQL follows:
CREATE OR REPLACE VIEW view_name AS
SELECT column1, column2, ...
AND date(now()) < '2017-09-01';
Controlling access to data is all well and good, but it must be audited – another nice feature of most SQL servers are triggers. Triggers allow you to monitor when the data is accessed, but very crucially, when it is NOT accessed. This is the beauty of keeping your data in a database; if you grant access to a view and notice that it is only accessed once, then you can be pretty sure that the user has exported the information. My recommendation would be to maintain an audit of which users accessed the various database views and configure alerts appropriately.
The final option is to create a dataset export, and to pick out fields that store personally identifiable information, then use a good example dataset to substitute where necessary or possible. A cursory search on the internet throws up generatedata.com which is based off a great open source project which has an API too.
My takeaway thought: If you don’t need it, don’t store it.