Notes on fixing statistical databases

For Survey 2 on the NetGen project, I’ve looked at the database yesterday. Just wanted to make notes on the changes, I’ve made and the reasoning why, before I forget because Ruslan has told me there are particular requirements for UK data archiving that is needed by the ESRC.

First of all, Ruslan had fixed some parts of the Survey 2, but these are my additional fixings and have saved it as a new file 28-09-09:

  • Any data that had numeric answers were changed from being string to numeric (this makes it easier for SPSS to analyse the data – particularly if we want means that is if we treat some of the Likert scales as continuous)
  • Some of the data is related to tick boxes, where when students have ticked the boxes this is coded as 1 where students have not ticked this has been left blank as a system-missing value. I have recoded these variables into the same variable to be that 1 = yes and system-missing value to be 0, so that 0 = no. I think it is a fair assumption if the students did not tick the box then it is a no.
  • For QE4, the data was coded as follows: 1= “don’t know”, 2 =”not at all useful”, 3= “not very useful”, 4 = “fairly useful” and 5 = “very useful”. If means were used for analysing this question, then it would be all wrong because of the “don’t know” is represented by 1. So, I’ve recoded into the same variable, so that “don’t know” is now represented by 5, and 1 = “not at all useful”, 2 = “not very useful”, 3 = “fairly useful” and 4 = “very useful”.

These are the things I intend to do today for Survey 2 and if I get time for Survey 3 and the linked Survey 2 and Survey 3:

Change system-missing to user-missing values

At the moment there are system-missing values (i.e. blanks) – I want to change these to user-missing values (i.e 999) (and make sure and add these to the variable as missing) – looked through the internet to see if there was a good reason for doing – or whether that was just my preference – and it seems to be more my preference than anything else … but here is my reasoning for doing it

  1. If we want to do a missing-data analysis it might be easier to sort out who were missing using the 999 as SPSS automatically disregards system-missing values
  2. If any new variable was computed in SPSS, blanks may represent that SPSS failed to compute an answer rather than a student failed to input an answer – so need to distinguish between that
  3. Should really be able to distinguish between questions that are not applicable to the student and where students missed out a question – although in this survey as far as I can see there are no ‘not applicable’ questions

Look for double entries

There seemed to be some double entries for students, I’m not sure whether to delete them or just exclude them from the analysis. As most of the double entries seem to have missing values anyhow they might just be excluded – but possibly shouldn’t take any changes and just omit them from the analysis. Hence I need to create a new variable which will be 1 = include and  o = exclude, where I can then select cases.

Create a new ID

Unfortunately at the moment, the NetGen ID is a string rather than numeric and whilst that is ok for identifying students, when ordering students it tends to be: 2NG10, 2NG100, 2NG11 which is not the best way for finding and ID. So, I’m going to create a numeric ID, by removing the 2NG – going to do this in Excel and then copy back over, since in Excel it will be easy to remove the 2NG.


  1. Had a bit of a panic just now – just had a fire alarm and met Simon outside. He told me that Ruslan had recoded all the attitudinal data to be coded as 1 (not at all) to 5 (very), rather than as 1 (very) to 5 (not at all) which is how the survey office sent it.

    I then realised for QE4, he did not actually put the codes in so, was wondering if that was still the original data (and I already put the codes in and recoded for the “I don’t know” – so big panic!!) … but just cross-checked with the original data, and the database Ruslan was working on – and have realised that he had recoded them to be 1 (Don’t know) to 5 (Very useful) – so panic abated!

  2. Just added the numeric ID – which has made this immensely better for sorting out students.

    Found that there may be duplicates for the following IDs (13,14); (161,162); (458,459); (691,692); (115,116); (605, 606). The interesting bit about this is that the students sometimes have slightly different answers (but there is a high probability it is them – since they have the same phone). I’m wondering is it possible that these students probably returned both a paper and electronic copy of the survey? Or did two surveys?

  3. Survey 3 was looking good because saw that Ruslan had made most of the variables numeric and that the codings were inputted as (1 = never to 5 = very often). However, I have the slight feeling that he did not recode the variables, I think they are still (1 = very often and 5 = never) – because with his codings was seeing most of the students saying they never sent an email (which has to be highly impossible). I also double checked with the raw data and this is the same in the SPSS data sheet. This means I got to recode them … but just want to double check with Ruslan.

  4. Ok, I’ve fixed survey 2 and 3. The merged survey 2 and 3 – might be more of a problem since some of the data from survey 2 might have been recoded whilst that from survey 3 – probably wasn’t. So either I would need to remerge or go through and check and recode. I can’t find a unique id for survey 3 in the merged data, only a unique id for survey 2 – so it might be more difficult to do the former.

  5. Ok have had a good look at the merged 2 and 3 survey data – not quite certain how Ruslan merged the two – but it looks more complicated to remerge rather than fix the variables. The data from survey 2 in the merged data is already recoded – unfortunately the data from survey 3 is not recoded – so, got to go fix that!

  6. Great! Have recoded all the variables in Survey 2 +3 merged to align with Survey 2 and Survey 3, and have now appended S2 to all variables related to Survey 2 in Survey 2 + 3. Ruslan had already did this for variables related to Survey 3, and thought it was a good idea to continue in that manner.

  7. For Survey 3, 3 sets of duplicates were found (ID 416, 416; 347, 348; 188, 189). To decide which set to take out, if the student had complete data for both IDs, then the one with the smaller ID was kept (although in some cases there were slightly different answers). So, using this reason ID 417 and 189 were removed.

    If one of the set had more data than the other then the other was kept. For this reason, ID 347 was taken out. In my database, I now have ID 416, 348 and 189. I’ve resaved the database to today’s date just in case I may need to retrieve the IDs I pulled out.

  8. For Survey 2, I removed the duplicated similarly to Survey 3 and renamed it today’s date.

    ID 14 and 459 were taken out because they had the higher ID. ID 161, 692, 115 and 605 had incomplete data sets. However, for ID (115;116), I appended the comments on the model from ID 115 make to ID 116.

    I now have ID 13, 162, 458, 691, 116 and 606 in the database.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s