By Looking at several Semantic Frameworks, General Concept of Semantics and specifically the RedLand Framework the following Notes were made.
The basic workflow is like this:
- Analysis of a Structured or Unstructured Document
- Result of the Analysis is a Formatted Data Tree
- Two Datatrees are then compared and later associated with each other, using Semantic Logic. The Result is a Tripple.
Extracting Content from an HTML Page or Tag Soup
Redland allows you to extract a Information Tree from a Unstructred Dataformat.
Where you got a HTML Page, Valid or Tag Soup, the parser extracts all type of defined Data.
The Exractor is designed for eRDF, RDFa, the core 15-20 microformats, OpenID hooks, basic Dublin Core, and other formats (and mappings)
Many formats don’t necessarily map to a single RDF vocabulary, e.g. a social graph aggregator might prefer FOAF as the target format, an address book application may be based on a certain vCard-RDF mapping. Additionally, only a subset of all possible triples in a document will be needed in a given context.
It was designed for parsing and serializing RDF/XML files. Later it evolved into a more complete framework with storage and query functionality.
Even tough its not developed anymore, Arc2 is the most-installed RDF library.
See also:
https://github.com/semsol/arc2/wiki/Extracting-RDF-from-HTML
Some Interesting Ressources are:
http://librdf.org/using.html
http://www10.org/cdrom/papers/490/
http://librdf.org/
http://journal.dajobe.org/journal/