Web content extraction is a key technology for enabling an array of applications aimed at understanding the web. While automated web extraction has been studied extensively, they often focus on extracting structured data that appear multiple times on a single webpage, like product catalogs. This project aims to extract less structured web content, like news articles, that appear only once in noisy webpages. Our approach classifies text blocks using a mixture of visual and language independent features. In addition, a pipeline is devised to automatically label datapoints through clustering where each cluster is scored based on its relevance to the webpage description extracted from the meta tags, and datapoints in the best cluster are selected as positive training examples.
The following is an example webpage with article title and content correctly labeled:
In the data collection phase,
- We used PhantomJS to navigate to the original webpage in a headless webkit browser, restoring its original layout intended for human audience.
- For each text block, a set of features is extracted, including size and position of the block, contained text, and other 300 different CSS properties.
- For each site, we collect at least two sample pages using this method.
Below is a screenshot rendered by the PhantomJS with all text blocks identified.
In the training and testing phase,
- We use DBSCAN to cluster similar text blocks across all pages for a given site. The resulting clusters nicely separate each class of page elements. For example, all article titles found on the same site will fall into the same cluster because they have a similar visual appearance thanks to the site template.
- We label all clusters using hints from the meta tags on the webpage or hints from already labeled dataset. This can also be done manually for higher accuracy.
- We trained a SVM with linear kernel using a 4-fold cross validation. The outcome shows that our approach achieves near perfect precision and recall.
Our final report is located here: https://github.com/ziyan/spider/raw/master/docs/final/final.pdf
The project is open-sourced under the MIT license: https://github.com/ziyan/spider