Bleeding Edge Series: Multi-Doc Summarization

In our new series, Bleeding Edge, we look at break through technologies within Text Analytics, Machine Learning and Natural Language Processing that are making the transition from academic lab rat to corporate cash cow. We’ll take a deep dive into the different technical approaches, and review some of the commercial applications, and the companies starting…


Extracting Media from HTML

Over the last decade, there has been a rapid development from static webpages with little interactive content to media-rich and dynamic web pages. This can be attributed to new web technologies that make embedding of multimedia content incredibly simple. For example, adding a video in HTML5 is as easy as wrapping it in <video> tags.…


A Benchmark Comparison Of Content Extraction From HTML Pages

Introduction Content extraction is the task of separating boilerplate such as comments, navigation bars, social media links, ads, etc, from the main body of text of an article formatted as HTML. The main content typically accounts for only a small portion of a page’s source code (highlighted in red in the image below). Extraction is…

A benchmark comparison of extractive summarisation systems

In this post, we report the results of the comparative evaluation of our Skim API against similar commercial and open-source extractive summarisation systems. Results indicate that our summarisation system consistently outperforms the analysed benchmarks, in terms of ROUGE-N.   The Information Age we are living in, fuelled by the advent of the World Wide Web…

