This morning I became aware of two news articles, published by Mashable and the BBC news site, on children’s top search terms. These articles reported on data provided by Symantec, aggregated from search terms that were collated through their OnlineFamily.Norton service software. This software is designed to allow parents to monitor (view web histories etc.) and control (set time limits etc.) children’s online activities, but apparently also feeds such data back to Symantec.
I want to raise two issues with regards to this data; first, and very briefly, the way in which this data is being used to ‘speak’ for children Internet users and secondly, the theoretical and methodologically questionable use of this data by commercial and media sectors. These two topics are also significantly interlinked, and by focusing on the issues of the theory and methodology (un)informing this data I hope to offer a challenge to the grounds on which children are spoken for through the data.
The real issue at stake here is the use of data by commercial firms that lacks a reflexive appreciation of the theoretical and methodological issues attached to the data’s collection and interpretation. Yesterday evening I read the most recent response by Mike Savage and Roger Burrows in their ongoing involvement in the debate of ‘The Coming Crisis of Empirical Sociology’. One aspect of their argument is that commercial firms are producing vast amounts of data, beyond the scale and budget of many academic projects, but often lacking theoretical or methodological tools that academics ceaselessly develop. Their suggestion is that academics need to begin to engage with this data, rather than to dismiss data produced by commercial firms. They give a number of examples, including the Tesco Store Card, which uses associational data to understand its customers. For example, it might be the case that people who purchase fuel at Tescos are twice as likely as others customer to purchase boiled sweets. Savage and Burrows suggest that there is a great deal to be garnered from both sides, in terms of large scale data sets for academics, and theoretically informed interpretations and rigorous testing of methodological tools for commercial firms.
On these grounds, I want to offer a de-construction of the data provided by Symantec to suggest how it could be better informed. There are some gaping holes in the information provided around this data set, and the following bullet points will outline my main criticisms:
• Age and Categories of Childhood – This data refers to children as a homogenous group of an unspecific age range and there is no break down of the different age categories. This can be extremely misleading and can produce false representations, as in the case of the BBC’s reporting of ‘Kids’ Top Search Includes Porn’. The internet habits of children vary vastly, and not just in terms of age but location, gender, socio-economic backgrounds, ICT abilities etc. I was particularly struck by the absence of the CBBC website, which in my own research I have come to understand is the most prominently used and trusted website for young children in the UK, thus the question is raised as to who is being represented by this data. Without a clear break down of age categories, or even a delineation of what age group the data refers to, there is very little meaningful information that can be garnered from the list of search terms.
• Sample Size – The size of a sample is extremely important in methodological terms. First, to contextualise the data to its population. We have no clues as to how large the sample of children are using the Internet with the Norton software monitoring their search terms. (And as I will show in the remaining bullet points, we don’t actually know who the sample are.) Secondly, if we were to have a break down of age groups, we might be able to appreciate the weighting of the data. Thus whether one particular age group has greater representation above another, for example teenagers may be represented over and above young children.
• Data Source and Data Collection¬ – We know that the data is from Symantec and that it is based on families who use their software. However this raises issues such as: how many families use the software? What sort of socio-economic background are these families from? Which family members is the software monitoring? We also need to be aware of the purpose of the data collection. Symantec are a company that produce software for parents concerned about their children’s Internet use. The prominence of terms such as ‘sex’ and ‘porn’ underline the issue that would lead parents to purchase such software.
• Temporally Locating the Data – One of the few pieces of information provided with the data was the time scale in which the data was collected (February 2008 to July 2009). This is quite significant as the data then represents a temporally limited period. Michael Jackson’s appearance on the search list and ‘Fred’, a YouTube sensation, offer suggestions of a temporary popularity in that given time frame. How this time scale was determined remains unanswered (for example, why not 12 months or two years?)
• Geographically Locating the Data – Finally, we have few clues as to where this data was collected from (American households? UK households? Globally?). The lack of presence of the CBBC would suggest it was either American or, potentially, global. Again, this is an extremely large loophole. Another missing site was Sulake’s Habbo. In 16th place was Webkiz, which Kzero research listed as having 6million unique users in Q2 2009, compared to Habbo’s 135million unique users in the same period (the largest for any virtual world). Yet Habbo, owned by Finnish Sulake, fails to appear in the top 100 websites. This would again suggest a predominantly American dataset.
Finally, it is worth noting that, if this data represents anything of children’s Internet activities, it is what children ‘look’ for, not necessarily what they visit. Thus to return to the original point of this blog post, the data informing the conclusions made by BBC news, Mashable, and, most importantly, Symantec, lacks rigorous debate. In particular, we should be wary of the conclusions we draw around children and their internet use from this data which lacks a considered, or unopen, methodology and theoretical background. Hopefully such debate will also create a forum in which academic and commercial bodies and persons can combine resources and skills to produce more rigorous understandings of the potentials of data and research.