Spatial correlation could be especially interesting. One could easily imagine extending from single words to bigrams, trigrams and general ngrams that would allow for exploring the geography of phrases, not just individual words that can easily have competing contexts.įiltering by language or country could reveal the differences in how the media of the world contextualize given words. Of course, once 126 billion words of text has been codified into 1.5 billion mappable geographic coordinates, distilling the media’s qualitative expression into the quantitative numbers required by machines, all sorts of analyses become possible. Yet, these maps can actually offer powerful insights into those biases, showing us the world we see when we read a particular word and helping us better understand for the first time at scale the nuances and biases of the geographic lenses through which we see the world around us. Of course, this map focuses on the literal word “love” rather than the general concept of expressing or experiencing love and does not include synonyms or variants like “loved” or “loving” or “loves.” Limitations of machine translation, insufficient context to properly geocode a given location mention, limited availability of news media in a given country or city and the overall biases of the global media all play a role in the resulting maps. 6, 2017 to as monitored by the GDELT Project Kalev Leetaru ![]() Geography of locations mentioned in close proximity to the word "love" in global news coverage April. Syria and its impacts across the Middle East as well as North Korea are well represented, as are DC, New York and Silicon Valley in the US due to continued coverage of Russia’s alleged influence operations. Overall this map makes considerable sense, with Europe, especially Eastern Europe, featuring prominently in mentions of the Russian president over the past year. ![]() The map below shows the final results for the word “putin.” In short, these are the locations mentioned most commonly in the immediate vicinity of the word "putin" over the past year. In other words, take any word, such as “putin” or “love” and you’ll get back a map of the locations that were mentioned most commonly in context with that word over a quarter billion worldwide news articles from the past year. The final result is a massive database that lists the top 1,500 locations on earth that appeared most frequently with each of the top 200,000 most frequently used words over the past year. That works out to a raw processing speed of around 3.3GB/s involving aggregating hundreds of billions of records into the final analysis. In contrast to those previous analyses that took hours to days to run and immense specialized codebases, the final analysis was completed with just a single line of SQL, taking just 307.3 seconds. ![]() For each of these 1.5 billion location mentions, the system built a histogram of the most common words in the English machine translation of each article appearing within 300 characters before and after each location reference. ![]() Its fulltext geocoding algorithms had identified in this quarter billion news articles a total of 1,528,264,141 mentions of 741,899 distinct locations on earth, working out to around one location mention every 82 words. From Apto GDELT monitored a total of 126,101,464,912 words of worldwide news coverage across 260,022,952 news articles in 65 languages totaling more than a terabyte of text. To explore this further, I recently tried mapping a full year of data from my open source GDELT Project to visualize the locations most closely associated with each word. Could the power of the modern cloud be used to process hundreds of billions of words of global news coverage to tease apart the underlying geographic contextualization of language and offer near-interactive mapping speed? Each of these previous mapping efforts required extensive highly specialized codes taking hours and even days of computing time to render the final maps.
0 Comments
Leave a Reply. |