Automatic extraction of semi-structured Web content: Case study of Brazilian football

Authors

  • Alexandre S. de Melo Departamento de Ciência da Computação - Universidade Federal de Minas Gerais (UFMG)
  • Hendrik T. Macedo Departamento de Computação – Universidade Federal de Sergipe (UFS)

Keywords:

Information Extraction, Production Rules, JEOPS, Wrapper, Crawler

Abstract

Information extraction techniques provide automated generation of a structured representation from unstructured or semi-structured content. Structured information enables or facilitates further processing by third-part Web applications. This work describes the implementation of a domain-oriented information extraction system. The system automatically converts semi-structured Web content into structured content, by means of object-oriented production rules that instantiate a specific domain classes provided. These rules are implemented in JEOPS, a Java-based first-order forward chaining inference engine. We have fully specified classes modeling the Brazilian Soccer Championship to show the feasibility of the proposal. Taking as input a Web site address, the system uses facts and rules defined in its knowledge base in order to identify related links, find the championship classification table and extract table data. As a result, it automatically fulfills domain classes’ instances. 

Author Biographies

Alexandre S. de Melo, Departamento de Ciência da Computação - Universidade Federal de Minas Gerais (UFMG)


Hendrik T. Macedo, Departamento de Computação – Universidade Federal de Sergipe (UFS)


How to Cite

de Melo, A. S., & Macedo, H. T. (2011). Automatic extraction of semi-structured Web content: Case study of Brazilian football. Scientia Plena, 5(8). Retrieved from https://www.scientiaplena.org.br/sp/article/view/640