Thursday, June 11, 2015

Aesthetic obsolescence VI: Limitations of Youtube data

A good deal of the discussion relating to my recent studies revolved around the reliability and propriety of using Youtube viewership data. The prefatory remarks to my very first essay on the subject had admitted that the data source is far from ideal for systematic research, and that I was using it in full awareness of its limitations.

This is why I submitted the data to only a basic level of mathematical analysis. The measure I used was a simple “average views per month of availability” with respect to the aggregate of views over a period. Throughout the reporting of the results, there was only minimal reference to the mathematical ratio. I preferred, instead, to call it an “audience involvement measure”. I did not report any of the statistical trend analysis I had done on the data for my own understanding. Even the graphic treatment of my computations was historical and generational rather than musician-specific.

I understand that, despite its known limitations, Youtube viewership data is now being used by researchers even in the more advanced research environments. Presumably because it is better than nothing, its use will not stop, And, because of its known limitations, the controversy over its value, too, will not end.

For those who feel interested in this issue, I decided to write out this report on the subject.

ONE: Youtube reports “views” for each recording. The first question is: what is a view? For a 30 minute video, what minimum duration of viewing qualifies as “view”? Further, is a 10 minute viewing the same viewership value as a full 30-minute viewing? We have no way of knowing this and, therefore, of using the notion of “viewership” more intelligently.

TWO: In the early days of Youtube, there was a duration limit on the uploads. This affected Hindustani music very significantly, as most performances exceeded the limit, and had to be split into 2, 3, and sometimes 4 parts for upload. After the duration limit was lifted, mostly complete performances were uploaded. Looking at that data today, can we consider the viewership data for a concert split into 3 parts comparable with that of a complete performance as a measure of viewership involvement? There is no logical way of building this known discrepancy into our measurement of audience involvement.

THREE: Youtube viewership data is cumulative from the date of uploading. By using that data, I am implicitly accepting one view of 2006 on par with one view in 2015.  During this period, Youtube viewership has grown exponentially, and its audience profile for every kind of content has almost certainly changed radically. Intuitively, we know the assumption is flawed. But, we have no way of adjusting for this flaw in our analysis and interpretation of the data.

FOUR: Another problem with cumulative data is that it obliges us, for instance, to equate 20,000 views accumulated over 20 months with 40,000 views accumulated over 40 months. Intuitively, this equation does not look reasonable. One of the two has to be more valuable as a measure of audience involvement. If the propensity of a recording to accumulate viewers is important, 40,000 over 4 years is more valuable. And, if the speed of audience accumulation is considered important, 20,000 over 20 months is more valuable. Youtube data, as available, does not permit us even to ask such a question, leave alone answer it.

FIVE: The Youtube audience is global, and so is the audience for Hindustani music. But, we have no data on the nation-wise mix of viewership for each recording. In our analysis, we are obliged to treat the viewership number as a homogenous mass -- which it certainly is not. By implication, we are assuming that foreign audiences of Hindustani music —across all nationalities and cultures -- have the same relationship with the music, and the same profile, as Indians have. There is enough evidence to show that this is not so. Our analysis and interpretation of the numbers can therefore lead to unjustified and even misleading inferences.

SIX: Although Youtube is a video medium, a large amount of Hindustani music on it constitutes either only audio content supported by a still visual, or audio supported by a slide-show. The nature of the content is itself not uniform. In fact, even the notion of “viewership” may be irrelevant to a lot of the content. Do 100 people listening to an audio recording with just a photograph of the musician on the screen represent the same level of audience involvement as 100 people watching him or another musician in action on film? If not, how do we devise an “equalization factor” between the three formats?

SEVEN: This listing merely covers data source features which impinge directly upon the focus of my studies. Researchers with different perspectives could enumerate a different, and perhaps larger set of limitations.

It appears that, with specific reference to Hindustani music, Youtube neither offers a uniform media experience to its audience, nor publicly provides a rigorous measurement of audience engagement. What, then, is Youtube data good for?

I believe it is better than nothing. It cannot be said to represent “statistics”, but can represent “orders of magnitude”. It can be used only as indicative and never as conclusive. Inferences should be drawn most judiciously from its analysis, with every inference reflecting the analyst’s awareness of data limitations.

A close scrutiny of Youtube user-interface suggests that a lot of very sophisticated analytics are being generated on viewership/ audience engagement by its managers for internal consumption. And, these analytics are being used for strengthening the relationship between Youtube and its users and advertisers. It is safe to assume that the viewership information Youtube currently offers publicly is also aimed at strengthening those relationships. It was not intended to be helpful to researchers and may, in fact, have been purposefully kept unusable for such purposes.

As Youtube grows into a major cultural force, it will find it necessary to understand itself better in order to keep growing.  For this, it will need to engage constructively with social scientists and media researchers in all the geographies and cultures where it has a significant presence. This could launch an era of greater transparency in Youtube analytics.

Until then, the Indian musicologist should be content with “orders of magnitude” and indicative inferences. Is this better than consulting 10 veteran Rasika-s and observers of the music scene? I believe so, because Rasika-s can acquire "personal" preferences (biases, prejudices); impersonally generated numbers cannot. 

© Deepak S. Raja 2015

Tejaswinee Kelkar, Musicologist, IIIT Hyderabad, writes: 

How the data from hundreds and thousands of videos are even displayed to users for search queries is not an uncomplicated task. It is well known that in search algorithms, popular musicians / videos that are viewed more are marked as relevant far more times than other ones. This skews the access to videos that aren't by the few of the most popular artists. It is known that search queries make views and results quite a bit unequal, by algorithmically computing relevance based on things like 'views', which can be very misleading many a times. 

This means that the search and view data that we have is already skew and a little less indicative of actual preferences, as opposed to bottlenecked by search results options that are shown. This is worthy of enquiry because we are talking about a system in which the available choices are more salient for views than ideally 'all possible' choices. 

On google trends, they do allow us to peek at search query data on various google products including youtube, and including country of origin of query etc. On this site it is possible to find query results, which i think could be a good way to substantiate what we are finding already from the data you have collected. 

Computational ontologies are quite responsible for the structure of music query results. The IEEE ontology, which is used for music databasing in most software such as iTunes, etc - is completely unrelated to the actual structuring features that we would like in classical music, as it has categories like 'genre', 'artist', 'album', which are less relevant to us than, say 'raga', 'gharana', 'form (khyal / dhrupad)' etc. We are trying to work on this by developing new ontologies that are relevant to the categories we want to search, and trying to build a system which will search open data (like youtube) through ontological annotations.