Big Data in Cognitive Science: Flickr as a database of semantic features. (preprint draft).

Share Embed


Descrição do Produto

CHAPTER 7 (pp.144-173): Big Data in Cognitive Science: Flickr as a database of semantic features.

Flickr® Distributional Tagspace: Evaluating the semantic spaces emerging from Flickr® tags distributions.

Marianna Bolognesi, PhD. International Center for Intercultural Exchange, Italy UvA University Amsterdam, Netherlands

Chapter proposal for Big Data in Cognitive Science

Bolognesi

Abstract

Flickr users tag their personal pictures with a variety of keywords. Such annotations could provide genuine insights on salient aspects emerging from the personal experiences that have been captured in the picture, which range beyond the purely visual features, or the language-based associations. Mining the emergent semantic patterns of these complex openended large-scale bodies of uncoordinated annotations provided by humans is the goal of this chapter. This is achieved by means of distributional semantics, i.e. by relying on the idea that concepts that appear in similar contexts have similar meanings (e.g. LSA, Landauer, Dumais 1997).

This chapter presents the Flickr Distributional Tagspace (FDT), a distributional

semantic space built on Flickr tag co-occurrences, and evaluates it follows: 1) through a comparison between the semantic representations that it produces, and those that are obtained from speaker generated features norms collected in experimental setting , as well as with WordNet-based metrics of semantic similarity between words; 2) through a categorization task and a consequent cluster analysis. The results of the two studies suggest that FDT can deliver semantic representations that correlate with those that emerge from aggregations of features norms, and can cluster fairly homogeneous categories and subcategories of related concepts.

2

Chapter proposal for Big Data in Cognitive Science

Bolognesi

Introduction

The large-scale collections of user-generated semantic labels that can be easily found online has recently prompted the interest of research communities that focus on the automatic extraction of meaning from large-scale unstructured data, and the creation of bottom-up methods of semantic knowledge representation. Fragments of natural language such as tags are today exploited because they provide contextual cues that can help solve problems in computer vision research; for example, the queries in the Google image searching browser, where one can drag an image and obtain in return other images that are visually similar, can be refined by providing linguistic cues. For this reason, there is a growing interest in analyzing (or ‘mining’) these linguistic labels for identifying latent recurrent patterns and extracting new conceptual information, without referring to a predefined model, such as an ontology or a taxonomy. Mining these large sets of unstructured data retrieved from social networks (also called Big Data), seems more and more crucial for uncovering aspects of the human cognitive system, to track trends that are latently encoded in the usage of specific tags, and to fuel business intelligence and decision making in the industry sector (sentiment analysis and opinion mining). The groupings of semantic labels attributed to digital documents by its users, and the semantic structures that emerge from such uncoordinated actions, known as folksonomies (folk- taxonomies), are today widely studied in multimedia research to assess the content of different digital resources (see Peters, Weller, 2008 for an overview), relying on the “wisdom of the crowd”: if many people agree that a web-page is about cooking, then with high probability it is about cooking even if its content does not include the exact word “cooking”. Although several shortcomings of folksonomies have been pointed out (e.g. Peters, Stock

3

Chapter proposal for Big Data in Cognitive Science

Bolognesi

2007), this bottom-up approach of collaborative content structuring is seen as the next transition towards the Web 3.0, or Semantic Web. Whereas the multimedia researchers that aim to implement new tools for tag recommendations, machine-tagging, and information retrieval in the semantic web are already approaching and trying to solve the new challenges set by these resources, in cognitive science the task-oriented data, collected in experimental settings, seem to be still the preferred type of empirical data, because we know little about how and to what extent Big Data can be modeled to reflect the human behavior in the performance of typically human cognitive operations.

Related Work: Monomodal and Multimodal Distributional Semantics

In the past twenty years several scholars managed to harvest and model word meaning representations by retrieving semantic information from large amounts of unstructured data, relying on the Distributional Hypothesis (Harris, 1954; Firth, 1957). The Distributional Hypothesis suggests that words that appear in similar contexts tend to have similar meanings. Distributional models allow the retrieval of paradigmatic relations between words that do not themselves co-occur, but that co-occur with the same other terms: book and manual are distributionally similar because the two words are used in similar sentences, not because they are often used together in the same sentence. Such models have been classically built from the observation of words co-occurrences in corpora of texts (Baroni, Lenci, 2010; Burgess, Lund, 1997; Landauer, Dumais, 1997; Sahlgren, 2006; Turney, Pantel, 2010; Rapp, 2004), and for this reason they have been often ‘accused’ of yielding language-based semantic representations, rather than experience-based semantic representations. In order to overcome the limitations of the language-based distributional models, there

4

Chapter proposal for Big Data in Cognitive Science

Bolognesi

have been recent attempts to create hybrid models, in which the semantic information retrieved from words co-occurrences is combined with perceptual information, retrieved in different ways, such as from human-generated semantic features (Andrews, Vigliocco, & Vinson 2009; Steyvers 2010; Johns & Jones 2012) or from annotated images, under the assumption that images provide a valid proxy for perceptual information (see for example Bruni, Tran, & Baroni, 2014). Image-based information has been proven to be non-redundant and complimentary to the text-based information, and the multimodal models in which the two streams of information are combined perform better than those based on solely linguistic information (Andrews, Vigliocco, & Vinson, 2009; Baroni, Lenci, 2008; Riordan, Jones, 2011). In particular, it has been shown that while language-based distributional models capture encyclopedic, functional and discourse-related properties of words, hybrid models can also harvest perceptual information, retrieved from images. Such hybrid models constitute a great leap forward in the endeavor of modeling human-like semantic knowledge by relying on the distributional hypothesis and on large amounts of unstructured, human-generated data. Yet, I believe, they present some questionable aspects, which I hereby summarize. Combining text-derived with image-derived information by means of sophisticated techniques appears to be an operation that is easily subject to error (how much information shall be used from each stream and why? Does the merging technique makes sense from a cognitive perspective?). Moreover, this operation seems to lean too much toward a strictly binary distinction between visual vs linguistic features (respectively retrieved from two separate streams), leaving aside other possible sources of information (e.g. emotional responses, cognitive operations, other sensory reactions that are not captured by purely visual or purely linguistic corpora). Moreover, the way in which visual information is retrieved from images might present

5

Chapter proposal for Big Data in Cognitive Science

Bolognesi

some drawbacks. For example, image-based information included in hybrid models is often collected through real-time “games with a purpose”, created ad-hoc for stimulating descriptions of given stimuli from individuals, or coordinated responses between two or more users (for a comprehensive overview, see Thaler et al., 2011). In the popular ESP game (Ahn, Dabbish, 2004, licensed by Google in 2006), for example, two remote participants that do not know each other have to associate words to a shared image, trying to coordinate their choices and produce the same associations as fast as possible, thus forcing each participant to guess how the other participant would “tag” the image. Although the entertaining nature of these games is crucial to keep the participants motivated during the task, and has little or noexpenses, the specific instructions provided to the contestants can constrain the range of associations that a user might attribute to a given stimulus, and trigger ad-hoc responses that provide only partial insights on the content of semantic representations. As Weber, Robertson, and Vojnovic show (2008), ESP gamers tend to match their annotations on colors, or to produce generic labels to meet quickly the other gamer, rather than focusing on the actual details and peculiarities of the image. The authors also show that a ‘robot’ can predict fairly appropriate tags without even seeing the image. In addition, ESP as well as other databases of annotated images harvest annotations provided by people that are not familiar with the images: images are provided by the system. Arguably, such annotations reflect semantic knowledge about the concepts represented, which are processed as categories (concept types), rather than individual experiential instances (concept tokens). Thus, such images cannot be fully acknowledged to be a good proxy of sensorimotor information, because there has not been any sensorimotor experience: the annotator has not experienced the exact situation captured by the image. Finally, in hybrid models the texts and the images used as sources of information have been produced /processed by different populations, and thus they may not be comparable.

6

Chapter proposal for Big Data in Cognitive Science

Bolognesi

Motivated by these concerns, my research question is the following: can we build a hybrid distributional space that 1) is based on a unique but intrinsically variegated source of semantic information, so to avoid the artificial and arbitrary merging of linguistic and visual streams; 2) contains spontaneous and therefore richer data, which are not induced by specific instructions or time constraints such as in the online games; 3) contains perceptual information that is derived from direct experience; 4) contains different types of semantic information (perceptual, conceptual, emotional, etc) provided by the same individuals in relation to specific stimuli; 5) is based on dynamic, noisy, and constantly updated (Big) Data. As it is explained below, the answer can be found in Flickr Distributional Tagspace (FDT), a distributional semantic space based on Flickr tags. Big Data meets cognitive science.

Flickr Distributional Tagspace

FDT is a distributional semantic space based on Flickr tags, i.e linguistic labels associated with the images uploaded on Flickr. As a distributional semantic space, FDT delivers tables of proximities among words, built from the observation of tags covariance across large amounts of Flickr images: two tags that appear in similar pictures have similar meanings, even though the two tags do not appear together in the same pictures.

The Flickr environment

Flickr is a video/picture hosting service powered by Yahoo!. All the visual contents hosted on Flickr are user-contributed (they are personal pictures and videos provided by registered users), and spontaneously tagged by users themselves. Tagging rights are restricted to self-tagging (and at best permission-based tagging, although in practice self-tagging in

7

Chapter proposal for Big Data in Cognitive Science

Bolognesi

most prevalent, see Marlow et al. 2006 for further documentation). Moreover, the Flickr interface mostly affords for blind-tagging instead of suggested-tagging, i.e. tags are not based on a dictionary, but freely chosen from an uncontrolled vocabulary, and thus might contain spelling mistakes, invented words, etc. Users can attribute a maximum of 75 tags to each picture, and this group of tags constitutes the image’s tagset. To the best of my knowledge there has been only one attempt to systematically categorize the tags attributed to pictures in Flickr. Such classification, performed by Beaudoin (2007) encompasses 18 post-hoc created categories, which include syntactic property types (e.g. adjectives, verbs), semantic classes (human participants, living things other than humans, non-living things), places, events/activities (e.g. wedding, Christmas, holidays), ad-hoc created categories (such as photographic vocabulary, e.g. macro, Nikon), emotions, formal classifications such as terms written in any language other than English, and compound terms written as one word (e.g. mydog). Of all the 18 types of tags identified, Beaudoin reports that the most frequent are: i) geographical locations, ii) compounds, iii) inanimate things, iv) participants, and v) events. The motivations that stimulate the tagging process in Flickr, as well as in other digital environments, has been classified in different ways, the most popular being a macrodistinction between categorizers (users who employ shared high-level features for later browsing) and describers (users who accurately and precisely describe resources for later searching) (Körner et al. 2010). While Flickr users are homogeneously distributed across these two types, ESP users for example are almost all describers (Strohmaier, Körner, and Kern 2012). Other models suggest different categories of tagging motivations: Marlow (et al. 2006) suggests a main distinction between organizational and social motivations; Ames and Naaman (2007) suggests a double distinction, between self vs social tagging, and organization vs communication driven tags; Heckner, Heilemann, and Wolff (2009) suggests a distinction

8

Chapter proposal for Big Data in Cognitive Science

Bolognesi

between personal information management vs resource sharing; Nov, Naaman, and Ye (2009) propose a wider range of categories for tagging motivation, which include enjoyment, commitment, self-development, and reputation. In general, all classifications suggest that Flickr users tend to attribute to their pictures a variety of tags that ranges beyond the purely linguistic associations of the purely visual features, suggesting that Flickr tags include indeed a wide variety of semantic information, which makes this environment an interesting corpus of dynamic, noisy, accessible, and spontaneous Big Data. Because all Flickr contents are user-contributed, they represent personal experiences lived by the user and then reported on the social network through photographs. Thus, each image can be considered as a visual proxy for the actual experience lived by the photographer and captured in the picture. In fact, operations such as post-processing, image manipulation and editing, seem to be used by Flickr users to improve the appearance of the pictures, rather than to create artificial scenes such as, for example, the conceptual images created ad-hoc by advertisers and creative artists, where entities are artificially merged together and words are superimposed. However, at this stage this is a qualitative consideration, and would require further (quantitative) investigation. Although (as described above) the motivations for tagging personal pictures on Flickr may differ across the variety of users, each tag can be defined as a salient feature stimulated by the image, which captures an experience lived by the photographer. These features (these tags), are not simply concrete descriptors of the visual stimulus, but they often denote cognitive operations, associated entities, and emotions experienced in that situation or triggered later on by the picture itself, which are encoded in the tags. Being an a-posteriori process, in fact, the tagging includes also cognitive operations which range beyond the purely visual features, but that are still triggered by the image.

9

Chapter proposal for Big Data in Cognitive Science

Bolognesi

The Distributional Tagspace This work builds upon an exploratory study proposed in Bolognesi (2014), where the idea of exploiting the user-generated tags from Flickr for creating semantic representations that encompass perceptual information was first introduced. The claim was investigated through a study based on an inherently perceptual domain: the words that denote primary and secondary colors. The covariance of the tags red, orange, yellow, green, blue, and purple across Flickr images was analyzed, and as a result the pairwise proximities between all the six tags were plotted in a dendrogram. The same thing was done by retrieving the semantic information about the six color terms through two distributional models based on corpora of texts (LSA, Landauer, Dumais 1997; DM, Baroni, Lenci 2010). The cluster analysis based on Flickr tags showed a distribution of the colors that resembled the Newton color wheel (or the rainbow), which is also the distribution of the wavelengths perceived by the three types of cones that characterize the human eye, thanks to which we are sensitive to three different chromatic spectra. On the other hand, the two “blind” distributional models based on corpora of texts, and therefore on the solely linguistic information, could not reproduce the same order: in the “blind” distributional models the three primary colors were closer to one another, and the tag green was in both cases the farthest one, probably due to the fact that the word green is highly polysemic. That first investigation, aimed at analyzing the distribution of color terms across Flickr images’ tags, showed that it is possible to actually capture rich semantic knowledge from the Flickr environment, and that this information is missed by (two) distributional models based on solely linguistic contexts.

Implementing FDT The procedure for creating a FDT semantic space relies on the following steps, as it

10

Chapter proposal for Big Data in Cognitive Science

Bolognesi

was first illustrated in Bolognesi (2014). All the operations can be easily performed in the R environment for statistical analyses (for these studies the R version 2.14.2 was used), while the raw data (tagsets) can be downloaded from Flickr, through the freely available Flickr API services1.

1) Set up the pool of chosen concepts to be analyzed and represented in the distributional semantic space. 2) Download from Flickr® a corpus of tagsets that include each of the target concepts as a tag. The metadata must be downloaded through the API flickr.photos.search, whose documentation

can

be

found

on

the

Flickr

website:

(https://www.flickr.com/services/api/explore/flickr.photos.search.html). In order to implement FDT the arguments api_key, tags, and extras need to be used. In api_key one should provide the Flickr generated password to sign each call; in tags one should provide the concepts to be mined; in extras one should indicate owner_name and tags. The reason for including the field owner_name is explained in point 4, while tags is needed to obtain the tagsets. There are several other optional arguments in flickr.photos.search, and they can be used to filter further the results, such as for example the date of upload. The number of pictures to be downloaded for each target concept depends on their availability. As a rule of the thumb, it is preferable to download roughly 100,000 pictures for each concept and then concatenate the obtained samples into one corpus (uploaded on R as a dataframe). An informal investigation has shown that smaller amounts of pictures for each tag produce variable semantic representations of the target concept, while for more than 100,000 tagsets per concept the resulting semantic representation remains stable. Thus, in order to keep the computations fast, 100,000 tagsets per concept is the optimal value. The tagsets 1

On demand, the author can release the raw data and the materials used for the studies described in the following sections.

11

Chapter proposal for Big Data in Cognitive Science

Bolognesi

download can be performed with the open source command-line utility implemented by Buratti (2011) for unsupervised downloads of metadata from flickr.com. This powerful tool is hosted on code.google.com and can be freely downloaded. 3) After concatenating the tagsets into one dataframe, they should be cut at the 15th tag so that the obtained corpus consists of tagsets of 15 tags each2. This operation is done in order to keep only the most salient features that users attribute to a picture, which are arguably tagged first. 4) Subset (i.e. filter) the concatenated corpus, in order to drop the redundant tagsets that belong to the same user, and thus keep only the unique tagsets for each user (each owner name). This operation should be done to avoid biased frequencies among the tags’ co-occurrences, due to the fact that users often tag batches of pictures with the same tagset (copied and pasted). For example, in a sunny Sunday morning a user might take 100 pictures of a flower, upload all of them, and tag them with the same tags “sunny”, “Sunday”, “morning”, “flower”. In FDT only one of these 100 pictures taken by the same user is kept. 5) Another filtering of the corpus should be done, by dropping those tagsets where the concept to be analyzed appears after the first 3 tags3. This allows one to keep only those tagsets that describe pictures for which a target concept is very salient (and therefore is mentioned among the first 3 tags). Pictures described by tagsets where the target concept appears late are not considered to be representative for the given concept. 6) Build the matrix of co-occurrences, that displays the frequencies with which each target concept appears in the same picture with each related tag. This table will display 2

Ranking the number of tags attributed to each picture in an informal analysis conducted over a sample of 5 million pictures (i.e. how many pictures have 1, 2, 3 tags etc), it appeared that most pictured contain 1-15 tags. After 15, the graph’s curve that indicates the number of pictures containing 15+ tags, drops dramatically. 3 This number is chosen without a specific quantitative investigation: out of 15 tags considered, the first 3 tags are considered to be the most salient, but a deeper psycholinguistic investigation could test whether the tagging speed actually decreases after the first 3 tags, suggesting a decrease in salience.

12

Chapter proposal for Big Data in Cognitive Science

Bolognesi

the target concepts on the rows and all of the other tags, that co-appear with each of the target concepts across the downloaded tagsets, on the columns. The raw frequencies of co-occurrence reported in the cells should then be turned into measures of association. The measure used for this distributional semantic space is an adaptation of the Pointwise Mutual Information (Bouma, 2009), in which the joint co-occurrence of each tags pair is squared, before dividing it by the product of the individual occurrences of the two tags. Then, the obtained value is normalized by multiplying the squared joint frequency for the sample size (N). This double operation (not very different from that one performed in Baroni and Lenci 2010) is done in order to limit the general tendency of the mutual information, to give weight to highly specific semantic collocates, despite their low overall frequency. This measure of association is formalized as follows: 𝑆𝑃𝑀𝐼 = log

𝑓𝑎,𝑏 ² 𝑁 𝑓𝑎 ∗ 𝑓𝑏

where a and b are two tags, f stands for frequency of occurrence (joint occurrence of a with b in the numerator and individual occurrences of a and b in the denominator), and N is the corpus size. The obtained value approximates the likelihood of finding a target concept and each other tag appearing together in a tagset, taking into account their overall frequency in the corpus, the frequency of their co-appearance within the same tagsets, and the sample size. Negative values, as commonly done, are raised to zero. 7) Turn the dataframe into a matrix, so that each row constitute a concept’s vector, and calculate the pairwise cosines between rows. The cosine, a commonly used metrics in distributional semantics, expresses the geometrical proximity between two vectors, which has to be interpreted as the semantic similarity between two concepts. The obtained table represents the multidimensional semantic space FDT. All the steps illustrated in the procedure can be easily done with the basic R functions, 13

Chapter proposal for Big Data in Cognitive Science

Bolognesi

besides step 7 for which the package LSA is required. In fact, FDT is substantially similar to LSA; yet, there are some crucial differences between FDT and LSA, summarized in Table 1:

Context

Measure of association

Dimensionality reduction

LSA

FDT

Documents of text (the matrix of cooccurrences is word by document)

Tagsets (the matrix of cooccurrences is word by word)

typically tf-idf (term frequency–inverse document frequency) SVD (singular value decomposition), used because the matrix is sparse.

SPMI

None, the matrix is dense.

Table 1: the three main differences between LSA and FDT, pertaining context type (of the cooccurrence matrix), measure of association between an element and a context, and dimensionality reduction applied before the computation of the cosine.

A cluster analysis can finally provide a deeper look into the data. In the studies described below, the data were analyzed in R through an agglomerative Hierarchical Clustering algorithm, the Ward’s method (El-Hamdouchi, Willett, 1986, Ward Jr., 1963), also called minimum variance clustering (see explanation of this choice in 5.1). The Ward method works on Euclidean distances (thus the cosines were transformed into Euclidean distances): it is a variance-minimizing approach which minimizes the sum of squared differences within all clusters and does not demand the experimenter to set the amount of clusters in advance. In hierarchical clustering each instance is initially considered a cluster by itself and the instances are gradually grouped together according to the optimal value of an objective function, which in Ward’s method is the error sum of squares. Conversely, the commonly used k-means algorithms demand the experimenter to set the number of clusters in which she wants the data

14

Chapter proposal for Big Data in Cognitive Science

Bolognesi

to be grouped. However, for observing the spontaneous emergence of consistent semantic classes from wild data, it seems preferable to avoid setting a fixed number of clusters in advance. In R it is possible to use agglomerative hierarchical clustering methods through the function hclust. An evaluation of the clustering solution, illustrated in the studies below, was obtained with pvclust R package (Suzuki, Shimodaira 2006), which allows the assessment of the uncertainty in hierarchical cluster analysis. For each cluster in hierarchical clustering, quantities called p-values are calculated via multiscale bootstrap resampling. P-value of a cluster is a value between 0 and 1, which indicates how strong the cluster is supported by data4.

Study One

The purpose of this study was to evaluate FDT semantic representations against those obtained from speaker-generated feature norms, and those obtained from linguistic analyses conducted on WordNet. The research questions approached by this task can be summarized as follows:

-

To what extent do the semantic representations created by FDT correlate with the semantic representations based on human-generated features, and with those emerging from the computation of semantic relatedness in WordNet, using three different metrics?

In order to achieve this, the semantic representations of a pool of concepts, analyzed with

4

Other validation methods such as the popular purity and entropy measures obtained for example with the software CluTo demand the experimenter to set the number of clusters, an operation which here was avoided on purpose.

15

Chapter proposal for Big Data in Cognitive Science

Bolognesi

FDT, were compared through a correlation study to those obtained from a database of humangenerated semantic features, as well as to the similarities obtained by computing the pairwise proximities between words in WordNet (three different metrics).

Semantic spaces and concept similarities in FDT and in McRae’s features norms

Given the encouraging outcomes of the exploratory study conducted on color terms, from which it emerged that FDT can capture perceptual information that is missed by other distributional models based on corpora of texts, a new investigation was conducted, aimed at comparing the distributional representations obtained from FDT with those derived from the database of McRae’s features norms, a standard that has often been used for evaluating how well distributional models perform (e.g. Baroni, Evert, Lenci, 2008; Baroni, Lenci, 2008; Shaoul, Westbury, 2008). McRae’s features norms is a database that covers 541 concrete, living and non-living basic level concepts, which have been described by 725 subjects in a property generation task: given a verbal stimulus such as dolphin, participants had to list the features that they considered salient for defining that animal. The features produced in McRae’s database were then standardized and classified by property types, according to two different sets of categories: the taxonomy proposed by Cree and McRae (2003), and a modified version of the feature type taxonomy proposed in Wu and Barsalou (2009). Both taxonomies are reported in McRae et al. (2005). Moreover, McRae and colleagues released a distributional semantic space where the proximities between each concept and the other 540 are measured through the cosines between each two concept vectors, whose coordinates are the raw frequencies of cooccurrence between a concept and each produced feature. The resulting table is a square and

16

Chapter proposal for Big Data in Cognitive Science

Bolognesi

symmetric matrix displaying all the proximities between each pair of concepts, like in a distance chart of cities. Each row (or column) of the matrix describes the distances (or better, the proximity) of a given concept against all the other concepts. In this study, a similar matrix was built with FDT, analyzing the concepts co-occurrences across Flickr tags, and then the lists of similarities characterizing the concepts in McRae’s features norms were compared to the lists of similarities characterizing the concepts in FDT through the computation of the Pearson Correlation Coefficient. However, since in Flickr not all the concepts listed in McRae’s Features Norms are well represented, only a subset of 168 concepts were selected because of their high frequency among Flickr® tags (> 100,000 photographs retrieved; e.g. airplane was considered, while accordion was dropped because in Flickr the amount of tagsets containing accordion among the first 3 tags was
Lihat lebih banyak...

Comentários

Copyright © 2017 DADOSPDF Inc.