by Otago Polytechnic

Taking a look at diversity

David Rozado's data analysis software uses computational linguistics to analyze institutional data.

Increasing volumes of data are now freely available on the internet, including the online presence of social institutions such as universities, government agencies, corporations and NGOs. Machine learning tools can be used to mine the natural language information contained in those data sets to discover underlying trends such as institutional biases, views and values. 

Senior Lecturer David Rozado has written computer programs known as web scraping spiders that automatically retrieve large volumes of online data from the internet by autonomously following website links according to predefined criteria. His spiders have retrieved all textual information contained on the internet domains of the top 50 universities in the United States.

Using this data, David has trained an artificial neural network to reconstruct the linguistic context of words in the corpus of data. As a by-product of the machine learning objective, the network embeds words into a learned vector space so that words that share common contexts in the source data are located in close proximity to one another in vector space. A word's close neighbours in context, and hence in vector space, reveal the sense in which the word is being used.

David has then used this model to examine the sense in which the concept of "diversity" is used by the 50 universities on their websites. Diversity can be defined demographically (in terms of physical or external characteristics such as race, gender, ethnicity, accent or nationality) and intellectually (in terms of mental characteristics such as viewpoints, beliefs, preferences or political opinion). The results of this analysis showed that these American universities tend to use the word "diversity" mostly to denote demographic types of diversity rather than intellectual types of diversity. A visualization of the word vectors derived from the American universities websites' data set is shown in the figure below, where geometric distance roughly shows the proximity of words in context and hence the semantic meaning of diversity.

David's Bachelor of Information Technology students have also developed a front end interface that allows members of the public to do their own queries on the same data set from the websites of the 50 universities. You can try out the beta version here.