Deep Neural Networks for Web Page Information Extraction

Web wrappers are systems for extracting structured information from web pages. Currently, wrappers need to be adapted to a particular website template before they can start the extraction process. In this work we present a new method, which uses convolutional neural networks to learn a wrapper that can extract information from previously unseen templates. Therefore, this wrapper does not need any site-specific initialization and is able to extract information from a single web page. We also propose a method for spatial text encoding, which allows us to encode visual and textual content of a web page into a single neural net. The first experiments with product information extraction showed very promising results and suggest that this approach can lead to a general site-independent web wrapper.

PDF Abstract

Datasets


Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Benchmark
Webpage Object Detection CoVA TextMaps Cross Domain Price Accuracy 78.1 # 3
Cross Domain Title Accuracy 91.5 # 3
Cross Domain Image Accuracy 93.2 # 3

Methods


No methods listed for this paper. Add relevant methods here