Target / Overview
What are hottest topics around the fascinating autonomous driving technology? A vast amount of companies are investing in this future assets, which will be essential for future transportation of people and goods. For this reason the technology is currently under heavy development. This analysis is based on 1.500 news on the given topic. I’ve used bing news and newsisfree as a source for english news. The news sources are ranging from http://www.huffingtonpost.com to http://www.telegraph.co.uk.
As you can see, the plot is showing the big players / companies, that are investing in the sector of self-driving cars: Google aka Waymo, Nvidia, Intel just entering the market, BMW started a cooperation with delphi to enhance their technology. A very interesting one is the cluster on the top left corner, as you see on the highlighted section below.
It’s about the struggle between Waymo and Uber engineer Anthony Levandowski, who is accused of reusing googles technology for his own start-up, which has been bought be Uber. “In a lawsuit filed last month in San Francisco, Google’s autonomous vehicle spinoff Waymo accuses former engineer Anthony Levandowski of stealing 14,000 pages of intellectual property and trade secrets, then using them to launch his robotruck startup, Otto. Just four months later, Uber acquired Otto for a reported $680 million—and that, Waymo alleges, is how the ride-hailing giant started using Google’s vital circuit board and sensor setup in its own self-driving cars.” See https://www.wired.com/2017/03/simple-theory-ubers-waymo-mess-just-sloppy/ for full article.
Following processing steps are involved
- Crawling news portals by a given query
- Fetching and cleaning of all news pages. I preffered a boiler plating approach over a simple clean content extraction since it delivers less but more relevant, ‘on-topic’ content
- used an cascaded approach for tagging. More on the topic later on
- tried different ways of transforming this topic into a graph. The co-occurence approach worked best.
- importing and plotting using gephi
The tagging consists of 3 different steps. The first phase is using a positive phrase lexicon, which is extracted automatically from a vast amount of news. This delivers a big amount of valid and commonly used phrases, e.g. self-driving cars. Secondly I’m using a chunk parser. A chunk parser is using a language specific morphology to extract meaningful multi token terms from a given text. Since the chunking delivers quite a amount of phrase, which are prefixed with some kind of stopwords, e.g. “the engine” or “our target” you need to make some clean-up. In the third and last step I’m picking up a single token terms, that are not intersecting with one of the previous tags.
The scoring is based on TF-IDF, an approach that is very know the information retrieval or tagging.
The clustering of the terms is simple co-occurence: the more often two terms appear together in one news, the more similar they are. After a preselection of the 100 most important topics, I’m creating co-occurence matrix, which creates a connection / edges between each top terms and each topics in a news. This way a natural clustering is created. Using the Force Atlas 2 algorithm for sorting the nodes creates nice and meaningful cluster. This algorithm is using a kind of attraction and repulsion logic for displaying the given nodes and edges. Things that often meantioned together will be closer in the gephi plot.
Currently I’m working on different ways to visualize the topics extracted from the news. New results will be posted on this page.