694 lines
46 KiB
Plaintext
694 lines
46 KiB
Plaintext
head 1.2;
|
|
access;
|
|
symbols;
|
|
locks; strict;
|
|
comment @# @;
|
|
expand @o@;
|
|
|
|
|
|
1.2
|
|
date 2007.07.14.13.56.46; author BenSzekely; state Exp;
|
|
branches;
|
|
next 1.1;
|
|
|
|
1.1
|
|
date 2007.07.14.00.13.13; author BenSzekely; state Exp;
|
|
branches;
|
|
next ;
|
|
|
|
|
|
desc
|
|
@none
|
|
@
|
|
|
|
|
|
1.2
|
|
log
|
|
@none
|
|
@
|
|
text
|
|
@%META:TOPICINFO{author="BenSzekely" date="1184421406" format="1.1" reprev="1.2" version="1.2"}%
|
|
%META:TOPICPARENT{name="WebHome"}%
|
|
|
|
For the Nala Demo design project see NalaDesign
|
|
@
|
|
|
|
|
|
1.1
|
|
log
|
|
@none
|
|
@
|
|
text
|
|
@d1 1
|
|
a1 1
|
|
%META:TOPICINFO{author="BenSzekely" date="1184371993" format="1.1" version="1.1"}%
|
|
d4 1
|
|
a4 647
|
|
In Section 0 we begin with preliminary discussion of the components we will use. We also discuss why we not using alternatives such as
|
|
Piggy Bank. In section 1 we begin with a basic description of how the system will work, end-to-end with architecture diagrams included. In this section, we will
|
|
detail the design decision and implementation techniques as much as possible. However to avoid disrupting the flow, we will include more
|
|
complete discussion of certain decisions and systems in Section 2. In Section 3, we include technical descriptions of the necessary
|
|
components. In section 4 we describe the development plan along with rough implementation estimates.
|
|
|
|
Section 0 - preliminaries
|
|
|
|
For lack of a better name, I initially call the system 'Nala' after the baby South American Caique parrot that is
|
|
living in my study for a couple weeks. She can't talk yet but is really good at debugging Javascript. (She just pulls
|
|
the keys off my Thinkpad keyboard with incredible speed if I turn my head for a second).
|
|
|
|
'Queso' refers to the web application framework running atop 'Boca', the RDF database. The main features we will
|
|
use in Queso are the Atom Publishing Protocol and SPARQL endpoints.
|
|
|
|
Before delving into the system design, we briefly discuss some existing tools and why have decided to use or not use them.
|
|
|
|
Piggy Bank
|
|
|
|
On the surface it appears that Piggy Bank has does almost everything we need of our application. It is a sophisticated system
|
|
that can detect RDF in pages via links and scraping. It stores locally in sesame, but can share data in remote semantic banks.
|
|
Both the local and remote semantic banks can be queried and browsed using facetted browsing and query. It uses the longwell faceted
|
|
browser and from we can tell, it is heavily intertwined in the longwell framweork.
|
|
|
|
Pros
|
|
- nice faceted browsing and search
|
|
- provides many of the required features of the project as is
|
|
- map/timeline
|
|
- is a firefox plugin
|
|
- extensible screen scraping framework
|
|
- has some form of remote storage
|
|
- it could almost be used as is.
|
|
|
|
Cons
|
|
- large, complicated, not very well documented code base
|
|
- backend store is not documented or easily accessed via API
|
|
- no built-in extensibility points for views or backends
|
|
- though it might be passible as is, we would only get it 'as-is'
|
|
as extending and hooking in to the large codebase might be
|
|
too difficult in the time of the contract
|
|
- a large part of the savingsin using piggy bank wouldbe for any UI
|
|
work they have done, but the UI does not satisfy the requirement of the
|
|
application as is.
|
|
- we'd have to modify code in the backend/java..this increasing the dev cost
|
|
by a lot
|
|
- even if we could hook in to their backend, or modify their front end
|
|
to user our backend, maintaining this code would be difficult.
|
|
- It does not appear that development on Piggy Bank is continuing aside
|
|
from small bug fixes
|
|
|
|
Queso, Boca
|
|
|
|
Boca is an open source RDF store developed by my former team at IBM. It has such enterprise features as revision tracking, access control
|
|
transactions, sparql query and a named graph data model and API. Atop Boca runs Queso, a Semantic Web application development platform allowing
|
|
simple Ajax/REST access to Boca named graphs and Sparql engine from javascript applications. Using these systems, we can build our
|
|
application rapidly because we won't have to write any service side code, and I'm quite familiar with the codebase should and
|
|
bug fixes or new features be required. Another great advantage of using Boca is that is has a very sophisticated programming model and
|
|
Web Service based access API. If our application takes off, and people are using it to generate large amounts of RDF, it will be much
|
|
easier to leverage that data using Boca.
|
|
|
|
Client Repository/Views
|
|
|
|
By not using Piggy Bank, we are on our own for implementing a client-side repository. The Client Repository, or CR, will be our very
|
|
simple Javscript-based local RDF store. At first, this store will reside in memory, but eventually we should be able to serialize it
|
|
to disk using JSON. We are also left to build all our own views. However, we think with a sufficiently simple and userful abstraction
|
|
in the CR we can built out the views defined in the requirements section. Graphically the views may not be incredibly sophisticated
|
|
but they should get the job done.
|
|
|
|
|
|
Section 1 - System Design
|
|
|
|
Section 1.1 - Data collection
|
|
|
|
In this section we describe in detail on how RDF data finds its way into the system. The named graph is the atomic unit
|
|
of stoage and management in our application. This works well because the named graph is the atomic unit in Boca and Queso,
|
|
significantly simplifying our design and implementation.
|
|
|
|
- When a webpage loads the screen scraping framework is invoked. Based on the URL and other possibly other
|
|
factors, a set of screen scrapers are chosen to run on the page. The scrapes fall into two categories
|
|
|
|
1 - code that looks for links to RDF sources
|
|
2 - code that produces triples directly from the page
|
|
|
|
We'll discuss the detail of how scrapers are registered and stored further on.
|
|
|
|
- we arbitrarily invoke the link scrapers(1) in some order, but maintain a collective set of RDF urls that
|
|
they find since some scrapers might produce the same url(s). For each URL we fetch it with an XMLHttpRequest.
|
|
|
|
- To save work on the client, and because the server will validate anyway, we assume the links contain valid RDF. For the RDF responseTxt
|
|
from each XHR we create an Atom Entry with the RDF as the content and content type RDF/XML, or whichever RDF content type the link
|
|
contained. (We may have to modified the Queso server to accepted additional RDF content types). The effect of posting the Atom Entry to the
|
|
Queso server is that each RDF source from the page will be automatically inserted into its own named graph. Either through adding extra metadata
|
|
through the Atom Entry XML (which will get transformed on the server) or inserting extra triples into the RDF payload, we can add additional
|
|
triples to the named graph. These might include the source URL, the scraperid, the current page that invoke the scraper and anything
|
|
else we deem necessary.
|
|
|
|
- Because we store all the provenance information for each graph: the URL, the scraper used, the date, etc...we can use sparql
|
|
ASK queries to determine if a page should be rescraped.
|
|
|
|
- For the purposes of walking-the-walk, I think we should be able to handle LSIDs at this point. We can setup a dedicated web-resolver
|
|
for our application and just grab the LSID metadata via a simple http get. This will show people how not having a full LSID client stack
|
|
is not a show stopper.
|
|
|
|
- Next, we invoke the triple scrapers (2) in some order. Just like each URL had its own Atom Entry, the set of triples generated from each
|
|
scraper will be put into an Atom Entry along with additional metadata as above. To do this, we'll make use of an RDF library to build
|
|
up and then serialize a collection of triples.
|
|
|
|
- Here is an example of an Atom Entry created on the client. Queso will add extra atom elements before it serializes it to
|
|
RDF. The Atom Entry is in fact stored in RDF in Boca. When requested, Queso can easily recreate the Atom Entry from the RDF triples.
|
|
The RDF triples generated from the Atom Entry as well as the triples parsed from the content are all stored in the same named graph. (see discussion)
|
|
|
|
<entry xmlns="http://www.w3.org/2005/Atom" xmlns:xml="http://www.w3.org/XML/1998/namespace" xml:base="">
|
|
<id>urn:qid:test.boca.adtech.ibm.com:entries:2050832554</id
|
|
<title type="text">APP Named Graph Entry</title>
|
|
<updated>2007-07-09T23:02:32.734Z</updated>
|
|
<author>
|
|
<name>APPNamedGraphService</name>
|
|
</author>
|
|
<link rel="nala-page" href="http://www.klinewoods.com/" />
|
|
<link rel="nala-src" href="http://www.klinewoods.com/foaf.rdf" />
|
|
<content type="application/rdf+xml">
|
|
<rdf:RDF xmlns:foaf="http://xmlns.com/foaf/0.1/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:wot="http://xmlns.com/wot/0.1/">
|
|
<foaf:Person rdf:about="urn:bszekely">
|
|
<foaf:name>Ben Szekely</foaf:name>
|
|
<foaf:mbox rdf:resource="mailto:bszekely@@gmail.com"/>
|
|
</foaf:Person>
|
|
</rdf:RDF>
|
|
</content>
|
|
</entry>
|
|
|
|
- The id in the <id> element is initially generated by the client, but in the current implementation of Queso, is replaced by one
|
|
that the server knows is unique, though in a format that is somewhat configurable. This id will be the id of the named graph containing the
|
|
Atom triples and content triples. We'll have a more detailed discussion of ids at in the discussion section. In particular, the names
|
|
of graphs and where we can use LSIDs.
|
|
|
|
- The two link elements specify where the RDF comes from. The first, "nala-page" is the page the triples or links were scraped from.
|
|
"nala-source" is the source of where the RDF came from in the case of (1) link scrapers. The URI of a triple scraper probably doesn't
|
|
belong in a link element. We'll most likely have to design a special RDF triple or Atom extension for this.
|
|
|
|
- We will add specific titles to the links as well as the entry which we will use to display them in lists and search results
|
|
|
|
- In the Atom Publishing Protocol (APP), entries must always be posted to a collection, specified in the POST url. For example:
|
|
|
|
http://localhost:8080/atom/urn:qid:nala.gbif.org:collections:namedgraphs
|
|
|
|
would post to a collection with URI urn:qid:nala.gbif.org:collections:namedgraphs. Internally, the collection is a named graph
|
|
that contains links to all the entries. In addition, however, it is a mechanism for providing Atom Feeds. In our application
|
|
it probably makes sense to a have a collection per page. This will allow users to track metadata collected about a page without
|
|
necessarily having to revisit that page themselves. We'll explore this idea more later.
|
|
|
|
- For the version of the application completed in Phase 2 of the contract, it probably makes sense for all named graphs to be publicly accessible
|
|
for ease of demonstration and debugging. In the discussion section below, we will discuss various security options and how they can
|
|
be implemented atop Boca's security model. Note that having an open security model does not prevent users from having working sets stored
|
|
on the server and any other per-user information. It just means that this information will be open as well. If Ricardo and Lee feel that
|
|
security is a priority for the first cut, we can make it so.
|
|
|
|
- Ok, so we have now posted one or more Atom Entires to Queso containing RDF scraped directly from the page or from links
|
|
contained in the page. Queso, as a response to the POST, will send back a beefed up Atom Entry with additional links elements, and
|
|
date elements. The RDF content will also contain the transformed Atom Entry in RDF. This may seem a bit weird, but the content of the
|
|
named graph now includes those triples.
|
|
|
|
- At this point in the application, the page has been quietly scraped for RDF, and everything has been uploaded to Queso, all without
|
|
the user having to do anything. At this point, we will alert the user, via icon in the status bar or toolbar, that RDF has been discovered.
|
|
Now, we have a temporary Client Repository with all the various collections of triples from links, and scrapers. As we scrape and POST, we can
|
|
easily maintain such a repository with a resource, with its properties and relationships, as well as metadata such as
|
|
keywords, the source page, the named graph(s), the date it was fetched/created etc..
|
|
|
|
Once all the scraping, fetching and posting has been completed, we can present the user with a table of all triples, partitioned (maybe with tabs)
|
|
by the namedgraph URI, and organized according to the Table View defined in the requirements list. Many approaches exist for presentation here,
|
|
but the bottom line is that we will give the user the chance to save all or part of the discovered RDF. The user can tag the saved name
|
|
graphs with keywords, either individually or on a whole.
|
|
|
|
- Once the user has selected the named graphs to save, and has ascribed optional keywords to them, we will call a special service that will
|
|
receive this information, and add the named graphs, by URI, to the user's collection in Boca. The service will also add the keywords to the index.
|
|
The service will be a simple HTTP post that contains a plain list of namedgraphs and tags
|
|
|
|
ng1 tag1 tag2 tag3
|
|
ng2 tag4 tag5 tag6
|
|
|
|
The request will be received by an HTTP Servlet that will interact through the Boca named graph API to
|
|
|
|
1) add the named graph to the user's collection of saved graphs
|
|
2) add the keywords to the keyword named graph, global for all users.
|
|
|
|
The graph in 2 will be the authoritative and permanent store for keywords, but will not do for fast lookups..(see below)
|
|
|
|
- Now, we can always maintain the map of ngrui->triples to facilitate querying later on, but it isn't clear how this cache will be useful.
|
|
|
|
Section 1.2 - Data retrieval techniques
|
|
|
|
Before proceding further, it will be useful to discuss the various approaches we will use for getting data out of the system. These
|
|
will be used heavily functionality described in the sections that follow.
|
|
|
|
- APP Named Graph requests
|
|
We can easily retrieve the contents of a particular named graph in RDF/XML. In many cases we will use this for showing
|
|
simple Table or Single views. However, in general, it is difficult to manipulate RDF on the client as there is limitted query
|
|
support and performance is not great. One advantage of this data access method is that the APP endpoint employs some sophisticated
|
|
caching mechanisms.
|
|
|
|
- Sparql Queries
|
|
This will be our prevailing approach to fetching data from the server. SPARQL queries can be issued against a single named graph,
|
|
a collection of named graphs, or all named graphs together. The results can be returned in a number of formats, the most convenient
|
|
of which, for our purposes is Java Script Object Notation, or JSON. JSON can be diretly evaluated into JS objects and easily traversed
|
|
to display the information. In addition, libraries such as Exhibit and Dojo make the display of JSON data quite easy.
|
|
|
|
Section 1.3 - Working Set management
|
|
|
|
** After thinking it over, we believe that the working set was intented to be what we below refer to as the Client Repository,
|
|
something that we are building up over searches and queries in the memory of the browser, not a specific set of named graphs that needs to be maintained over time.
|
|
We leave the design the Working Set Manager here in case we want to implement it later, but I believe that it is not necessary after
|
|
further examination of the initial requirements **
|
|
|
|
** We will need a similar system that can be used to manage saved named graphs. We will retain the parts of this design that
|
|
pertain only to saved named graphs **
|
|
|
|
In this section we discuss how the user manages his Working Set. First we discuss the difference between his Working Set
|
|
of named graphs, and his Saved Set of saved named graphs. The Saved Set is simply the set of graphs that the user wishes
|
|
to remain in the system and not get deleted during maintenance (see below). The Working Set is the set of graphs the user
|
|
is paticularly interested in at the moment, i.e. the ones being used to solve a particular task. In fact, it may make sense
|
|
to allow multiple working sets of graphs.
|
|
|
|
- From the toolbar or other sensible location, the users selects to "Manager Working Sets"
|
|
|
|
- The Working Set Manager (WSM) will be an HTML page inside the plugin that has several column views.
|
|
|
|
- The leftmost view will be a list of all Working Set names. The data will be retrieved by SPARQL
|
|
The user will be able to add or delete working sets from this column. These operations will immediately
|
|
go to the server and perform the change.
|
|
|
|
- When a working set is selected in the left column, each named graph in the set will be displayed in the next columm
|
|
over, again using Sparql. Instead of showing the URI, we will display a human readable title from the Atom Entry. The sparql
|
|
query can easily pickout this title from the named graph. The user can can select a group of graphs to "unsave" if he
|
|
deems that he will never be interested in them again.
|
|
|
|
- The third column will contain a list of named graphs in the users saved set that are candidates for the working set. When we
|
|
fetch these from the server using sparql, we can join with the keyword index and bring down all keywords for each named graph
|
|
in the saved set. The user will then very quickly be able to refine this list via incremental keyword search.
|
|
|
|
- Once the user is satisfied with the state of the current working set, he can push the changes to the server using a special
|
|
request similar to the one that occurs when the users saves browsed RDF.
|
|
|
|
- The user can also choose to export the data to tab-seperated or CSV format. It is unclear whether that should be a function
|
|
of the WSM or a function of the table view, but either way, some sort of shortcut will be provided here.
|
|
|
|
- Now it might be tempting here to add additional functionality to the WSM such as finding new named graphs to add to the saved set
|
|
and then maybe the working set, but this would be overloading the WSM, and encroaching on the jobs of the other bits of the
|
|
application.
|
|
|
|
Section 1.4 - Data query and viewing
|
|
|
|
The details of how data is searched, queried, browsed and viewed will most likely evolve as the application is developed. The design
|
|
and functionality described here is merely our starting approach, and first attempt to meet the outlined requirements.
|
|
|
|
- The views of the data fundamentally provide a view of many resources and their properties are relationships. Internally, the data is
|
|
stored in the Client Repository as javascript objects: a resource, with its properties and relationships, as well as metadata such as
|
|
keywords, the source page, the named graph(s), the date it was fetched/created etc..
|
|
|
|
*All views: Tree, Table, Map* and Link* are fundamentally all views of this core data structure.
|
|
|
|
** After further evaluation, we feel that the Client Repository fullfills the requirements of the Current Set better than the
|
|
WSM system designed in 1.3
|
|
|
|
Tree View - The top level of the tree represents all facets on which we want to browse the data. For now, these are hard-coded
|
|
facets such as source page, scraper, date, keywords. When the tree view loads, it builds itself at runtime based on the
|
|
current state of the Client Repository. When a branch is opened, such as keywords, the next level allows us to show
|
|
all resources find so far, and the # of them, or futher restrict it like at the root. This continues until the user is
|
|
satisfied of the restriction. when a resource is selected, we
|
|
can move to the single view which shows all the available information about the resource as well as where it came from.
|
|
Alternatively, the whole set of resources can be selected and shown in table view. This whole process is equivalent
|
|
to showing a Table view, refined using the search bar at the top with keywords, source, etc...
|
|
|
|
Table View - A table view shows a table of resources, one row per resource. Each column represents a proprety. If a resource has
|
|
no value for that property then the cell is blank. At the top of the table will be a method for the user to
|
|
refine the contents of the table by keyword, source, date, or other metadata. He can also add or remove columns from
|
|
the table.
|
|
|
|
Link View - All resources in the current Client Repository will be displayed as nodes in a graph, and with labeled edges between them.
|
|
|
|
Map View - All resources in the current Client Repository that have geographic information will be displayed on the map. This view must
|
|
be extensible so that different types of geographical information can be extracted i.e. different ontologies for long-lat coords,
|
|
city-state-country representations, etc...
|
|
|
|
- The Client Repository, a fancy name for the javascript object structure, must be carefully designed to allow what we want.
|
|
|
|
- While the data on the server is stored by named graph so that data provenance is fully maintained, on the client, the requirements of the application
|
|
dictate that the Resource is king, i.e. the subject of each triple. So the top level object will be an associative array keyed by
|
|
Resource uri.
|
|
|
|
- Now we have a decision to make. do we want keywords, named graph, source and other metadata associated with triple or with
|
|
each resource. The former is clearly much more expensive, but the laters will allow the use to know only the set of keywords associated
|
|
with a resource, or the set of namedgraphs and web pages that the resource is associated with, not each statement. We'll go with the relaxed
|
|
approach, assuming that it will be sufficient. Afterall, the complete history of every statement and namedgraph lies in the Boca and can be
|
|
specially queried with SPARQL if necessary. Thus the lists of keywords, dates (we'll do some clever aggregation here), named graphs, source
|
|
pages, etc.. will each get an array hanging off the main object.
|
|
|
|
- Next, we'll have an array of statements. There are a number of ways to do this. One approach will be to have an associative array
|
|
whose key is the full URI of the predicate, and whose value is an array of the objects. We'll start with this approach and modify
|
|
it as we go forward.
|
|
|
|
- The CR is convenient LCD approach to providing a data representation to each of our views, as well as way to temporarily display search
|
|
results and captured RDF.
|
|
|
|
- Populating the client repository
|
|
|
|
- Sometimes we'll use a CR data structure for one-off jobs such during data collection when we want to present to the user a table view of all the data
|
|
scraped from the current page. In this case, we'll just build a client repository and then view it, and then destroy it when we are done
|
|
(though not before possibly caching the data in the main CR).
|
|
Other times, we'll want to keep the client repository around during a user's session, maybe even serialize it to disk to be retrieved
|
|
when they come back online. This will be a bonus feature if there is time. However this use of the CR is what the requirements list
|
|
refers to as the Current Set.
|
|
|
|
- The first technique will be from a complete named graph in RDF/XML via APP. Here we'll use the 3rd party RDF parser to get all the triples,
|
|
but then iterate through and put them in our store, taking node of keywords etc..as they are available, if at all.
|
|
|
|
- The second technique is through SPARQL queries. Here, depending on what the users supplies, we may not get keywords or other metadata
|
|
unless we either query for it after, or force the user to query for it. We will write a very simple routine that takes SPARQL results
|
|
in JSON and converts them to CR format. We can then offer the user the option of filling in the missing provenance information through
|
|
extra queries which we will generate automatically for them.
|
|
|
|
- Search and getting new data
|
|
|
|
So now the user has just browsed a bunch of pages, and has quite a bit of RDF saved in Boca. Depending on how we implement it,
|
|
they may already have some cached stuff in the main CR as well. There are a number of ways for the Scientist in our scenario to do his search.
|
|
No matter what the search, the scientist may choose whether or not he is searching the whole datastore or only his saved data.
|
|
|
|
The search area will be an HTML page that has different tabs or modes that allow for the different types of searching.
|
|
|
|
Once a search has been completed, the user might not want all statements immediately placed in the CR. In practice, we'll most likely
|
|
build up a temporary CR (like in data collection), show it to the user via Table View and let them select which resources go into the main CR. This
|
|
process will happen exactly once per search. The selection process can involve keyword or other search on the table view of the
|
|
temp CR
|
|
|
|
After this final step, the user can do another search, or select a view of the main CR to browse the data, and possibly export.
|
|
|
|
- Keyword Search: The global keyword index on the server can very quickly return all the namedgraphs for a particular keyword. Depending
|
|
on the scope restriction, we'll have to do a quick sparql query on the server to remove the unncessary graphs, and then return them to the
|
|
client with all the information needed to populate a temporary CR. If the user issues multiple keywords, we'll let the users
|
|
specify if it is an AND or an OR, but nothing complex like "k1 AND (k2 OR k3)". One approach that we'll try initially here, is that
|
|
the keyword search will just return the named graphs, and then the client will issue APP requests for the RDF itself. This will
|
|
enabled us to leverage the caching functionality of Queso.
|
|
|
|
- Text search: Boca has a built-in Lucene-based text indexer, but its performance and realability questionable. It works just like the keyword index
|
|
we'll build works. You give it a search string, and it returns matching named graphs. The indexer indexes on all literal values. If we
|
|
have time, we'll try to leverage this search feature for the Application
|
|
|
|
- Sparql Search: Using the sparql endpoint, we can issue queries from the client to the server, getting the results back in JSON. Depending
|
|
on the query the user issues, we may be able to issue additional auto-generated queries to fillin keywords, source pages, scraper data
|
|
etc...The user may not need this information anyhow. After the sparql query has been issued, we'll transform the JSON and insert the
|
|
resources and statements in the temporary CR. We may have to restrict the type of sparql queries we allow the user to issue in order
|
|
that they fit nicely into the CR datamodel. Alternatively, we can allow a loose sparql query, when the data comes back, allowing
|
|
the user to perform a Resource search, or further explore the data. We will refine this design as we move forward with implementation.
|
|
|
|
- Resource search: if a user encounters a particularly userful resource, he can issues a request to the server to find all
|
|
named graphs that mention that resource. This is just a special case of a sparql query, but it deserves special treatment in the UI.
|
|
The search results of this search result will be shown in a single view.
|
|
|
|
* New named graphs encounter via search may be added to the user's list of saved graphs.
|
|
|
|
|
|
Section 1.5 - Scraper registration and storage
|
|
|
|
As discussed in 1.1, we will have two different types of scrapers, those that search for links to RDF, those that extract RDF triples
|
|
themselves. There are numerous security risks involved with allowing people to register JS code that can run. However, there are ways
|
|
to properly sandbox the code that we will make use of.
|
|
|
|
- Each scraper implementation can depend on certain objects that it can access in order to do its work.
|
|
|
|
- The code for each scraper will be stored as an Atom Entry in Queso/Boca. When the application loads, we can easily
|
|
find all scrapers using Sparql queries.
|
|
|
|
- Scrapers can have metadata associated with them that helps determine which websites they should be applied to.
|
|
|
|
- we'll have a simple page for registering scrapers and loading them into queso
|
|
|
|
Section 1.6 - Data export
|
|
|
|
Data export should be a straightforward operation atop the content repository. In the first cut of the application we will implement
|
|
export atop the Table View. The user will be allowed to select all or a subset of the rows in the table and choose to export them in one
|
|
of the available formats. We will allow comma-seperated-values (CSV) and tab-seperated-values (TSV) to start with. The user can also
|
|
specify whether or not they want all predicates exported or just the selected ones. In general, all predicates will yield a very sparse
|
|
export.
|
|
|
|
- given an array of resources and predicates, we can easily generate the proper output.
|
|
|
|
Section 1.7 - Keyword-named graph indexing
|
|
|
|
In data storage systems where full text search is difficult to implement or not performant, keyword search is often employed
|
|
to provide a quick alternative. To support this in our application, we will build keyword indexing as an auxilliary service atop Queso
|
|
and Boca, much like the saved-graphs service defined in section 1.1. There is a slight difference in that the keyword indexing service
|
|
might be something that belongs in Boca/Queso and is not specific to our application.
|
|
|
|
- Keywords will be permanently stored in a global named graph such as urn:qid:nala.gib.org:system:keywords. We don't spread the keywords
|
|
over multiple named graphs because of the overhead of reading through all of them while building indexes. We have a couple different
|
|
choices for how this named graph is arranged:
|
|
|
|
(1) Flat: a simple flat list of triples of the form <nguri> <queso:keyword> "keyword1"
|
|
(2) Per NG: <nfguri> <queso:keywords> <_bnode:1>
|
|
<_bnode:1> <queso:keyword> "keyword1"
|
|
<_bnode:1> <queso:keyword> "keyword2"
|
|
|
|
|
|
(1) allows us to issue the query for "all NGs for a keyword" or "all keywords for a namedgraph" withoug
|
|
performing any joins. Indexes should help with both of these queries.
|
|
(2) queries for all ng's for a keyword will be expensive.
|
|
|
|
Also, we will not be querying this graph directly when a user issues a keyword search. This graph will be used mostly to
|
|
build indexes.
|
|
|
|
we procede with choice (1).
|
|
|
|
- It's very easy to build storage-efficient in memory indexes for keyword searches. We'll actually take a two-index approach
|
|
|
|
Type-ahead search index (TASI): As the user types keywords, we want to help auto-complete them. To facilitate this, we simply keep
|
|
a sorted array of all the keywords on the server and as the user types a character, we send the word typed so far (prefix) to the
|
|
server via Ajax. We then do two binary searches on the array to find the range of keywords that match the prefix and return them.
|
|
|
|
Named-graph index (NGI): Once the user has chosen the keyords, we want to be able to quickly find the corresponding named graphs. The reason
|
|
for this is that SPARQL queries across the whole database are very slow, so if we can a priori narrow down the set of named graphs
|
|
for the query, we can reduce the scope and much more quickly return the content from those graphs.
|
|
|
|
The difficult bit here is how to update these namedgraphs as more namedgraphs are tagged with keywords. It basically boils down
|
|
to inserting values into a sorted array, which is O(nlogn). If the number of keywords isn't huge, this shouldn't be a problem. For the demo
|
|
we'll simply add the new values to the end of the array and call Arrays.sort(..) and see what happens. In practice, we'll probably want
|
|
to copy the array before insertion so we don't foul up queries that happen during the sort.
|
|
|
|
|
|
- to summarize, the steps of keyword search are as follows:
|
|
|
|
- in the search bar, the user types multiple words. As the user types each word, it does queries against the (TASI), helping the
|
|
user discover keywords that will yield any results at all.
|
|
|
|
- the application then sends the keywords to Queso, which uses the NGI to find all the named graphs for each keyword and computes
|
|
the union or intersection depending on AND or OR. It then returns just this list of NGs to the client.
|
|
|
|
- the application then issues named graph requests over APP for each of the named graphs, parses the request using the 3rd party
|
|
parser, and inserts the triples into the tempory CR for review by the user.
|
|
|
|
|
|
Section 2 - Extended Discussion
|
|
|
|
In this section we discuss some design and technical issues that didn't make sense to delve into in Section 1. Some of the discussion here
|
|
raises issues that we will not necessarily deal with in the demo version of the application, but that should be at least brought up to be
|
|
addressed in future versions. Most of these issues come in various places in the application and affect the design of the system all the way
|
|
through the stack.
|
|
|
|
Section 2.1 - Caching
|
|
|
|
Caching will be very important to the scalability and performance of our application. Data may be cached both in memory on the Web server, and
|
|
in the browser extension. Some caching is provided for us by existing components and some caching occurs inherintly by the behavior of the system.
|
|
At this point, we do not see the need to build additional caching mechanisms.
|
|
|
|
- client-side caching: The Client Repository is our client-side cache. All of the views in the application operate on this local data
|
|
instead of going to the server for each view. This inherent caching mechanism is crucial to perfomance.
|
|
|
|
- server-side caching:
|
|
|
|
- Queso cache: All requests to Queso/Boca via Atom Publishing Protocol (APP) are automatically subject to caching. Since much of the
|
|
data will be brought down to the application via APP, we will benefit automatically from its caching mechanism. An interesting property
|
|
of our system is that named graphs are write-once, read-only. This is not a restriction of Queso or Boca, but merely a concequence of
|
|
our application. Every bit of scraped data gets its own named graph. The consequence is that the data can be cache indefinitely.
|
|
|
|
- keyword search: This is in fact just a cache of the special-case query for keywords.
|
|
|
|
Section 2.2 - named graphs/duplicate data and ids
|
|
|
|
Every application that makes use of a named-graph based triple store must decide how to partition the data across graphs. In our application,
|
|
each set of triples accumulated by a particular scraper or triples fetched from an RDF link receives its own named graph. This makes updating
|
|
the data store very simple: find some triples, create a named graph for them. It is not without its problems, however. Right now, if a page
|
|
is scraped and saved repeatedly, we'll have duplicate data for the page in the system, possible within the saved graphs of a single user. Because
|
|
we store all the provenance information for each graph: the URL, the scraper used, the date, etc...we can use sparql ASK queries to determine if
|
|
a page should be rescraped. These queries can be fairly expensive because they must be posed on the entire datastore. We now discuss a few
|
|
alternative named graph data models, and why they were not selected.
|
|
|
|
One alternative to this approach was to have each resource get its own named graph. Then, adding duplicate data would be a no-op. The problem
|
|
here is that we lose all provenance about where data came from. In particular, if data about a particular resource is conflicting, sticking it
|
|
all in one graph could create false information. Solving this knowledge consistency problems is beyond the scope of the application.
|
|
|
|
Another approach would be to have a named graph per web page. This works OK for static web pages. As the web page changes, we would eventually
|
|
rescrape it and then we'd just get a new version of the named graph. However, data from different scrapers may create conflicting triples. This
|
|
information needs to be seperated out. The picture gets even more complicated when we have dynamic web pages, or javascript based web pages.
|
|
|
|
When all is said and done, the safest approach is just to have each self-contained piece of RDF reside in its own named graph. We can handle
|
|
duplicate data by not rescraping when we don't have to, and by handling duplicates at the CR level on the client when we display the information
|
|
to the user. If triple explosion becomes a problem, we can re-evaulate our approach or engineer new ways to avoid duplicate graphs.
|
|
|
|
A somewhat convenient corollary to this is that the URIs for named graphs can be anything we like. We can simply allow the Queso system
|
|
to generate named graph URIs for us, and the application will just treat them opaquely.
|
|
|
|
Section 2.3 - Security
|
|
|
|
When we speak about security, we don't mean making the web server hack-proof or worrying about malicious scripts running in the browser. These
|
|
are all important issues, but not of immediate concern. What we really intend to discuss here is the security model for our application.
|
|
As a demo application, its ok for all named graphs and keywords to be public information. However, in a real deployment this might not
|
|
be the case. If people are browing private websites, they might not want to upload private RDF into a public database. Furthermore, keywords
|
|
might themselves be private.
|
|
|
|
One possible security model would be that each named graph in the system is created under the user of the application. That user can then through
|
|
Boca/Queso management tools, decide to give permission to other people to see their graphs. The problem with this is that under normal
|
|
operation (which is what 99% of users do), most data would be partitioned by user, reducing the collaborative benefit of the system.
|
|
|
|
Section 2.4 - maintenance
|
|
|
|
Because we retain all RDF that we browse, things could potentially start to get out of hand quickly. We would really like to have a mechanism
|
|
for removing data that isn't saved after some period of time. Unfortunately, Boca does not have a way to delete named graphs. We can clear the
|
|
contents of the named graph, but the triples will still exist in the history tables. We will leave our design as is for now, but we may need to
|
|
either modify Boca (not preferable), or revise our design so that only the explicitly saved named graphs are stored in the database.
|
|
|
|
Section 2.5 - inference and cross-mapping types
|
|
|
|
Inference is one of those tricky features of a Semantic Web system that is difficult to get working right. We see two basic approaches to
|
|
inference in our system. First, we can rely on Boca to infer triples based on owl:sameAs. I believe it does this, but my buddy who implemented
|
|
that is away now. When he gets back I can ask him how it works. I believe it should provide the rudimentary behavior outlined in the requirements.
|
|
|
|
The second approach to inference would be to do something simple in the Client Repository to entail extra triples in the views. The benefit of this
|
|
approach is that the rules could be applied per user. The rules themselves can be stored as Atom Entries in Queso, and thus shared, but application
|
|
of the rules would be per user, as is *not* the case in Boca.
|
|
|
|
Section 2.6 - LSIDs
|
|
|
|
Thus far we have identified one place where we can make use of LSIDs in the application. We can pick out LSIDs when we are scraping and
|
|
attempt to resolve their metadata using a web resolver.
|
|
|
|
As we pointed out earlier, named graphs as they are accessed by the application, are write-one read only entities. Thus, we could use LSIDs
|
|
as named graph URIs in our system, and add an LSID resolver atop Boca/Queso. The main challenge would be to figure out extra triples to return
|
|
to present a meaningful web of LSIDs to browse.
|
|
|
|
Section 3 - Technical components and libraries
|
|
|
|
Here is a condensed list of some of the required technical components. It's more of a summary of the previous section.
|
|
|
|
- RDF Store and REST API
|
|
- Boca
|
|
- Queso
|
|
|
|
- Atom Javascript library
|
|
- generate and parse valid Atom entries
|
|
- Atom Publishing Protocol client library
|
|
|
|
- Javascript RDF library
|
|
- To parse RDFXML and collect pure RDF we'll use the 3rd party library available at
|
|
http://www.jibbering.com/rdf-parser/ for some
|
|
- RDF-XML serializer
|
|
- Client Repository
|
|
- load from RDF triples
|
|
- load from json
|
|
|
|
- Sparql library to issue queries against Queso/Boca
|
|
- sparql.js
|
|
|
|
- Data export mechanism
|
|
|
|
- Query interface
|
|
- keyword
|
|
- sparql
|
|
- resource query (in views)
|
|
|
|
- Views
|
|
- table
|
|
- tree
|
|
- single
|
|
|
|
- REST services
|
|
- graph save/keyword tagging service
|
|
- keyword type-ahead service
|
|
- keyword search service
|
|
|
|
Section 4 - Development plan
|
|
|
|
Of the remaining 5.5 weeks, or 220 work hours in the contract, I will be working full time for about 3 weeks, and then part time to finish it off.
|
|
I'm still splitting the work into 40 hour increments though they may not fall on boundaries of actual weeks.
|
|
My goal is to submit the application for final review by the end of August. The timetable below is a rough for completing the various parts
|
|
of the application. The purpose is really to let me know if I'm spending too much time on one thing, but I'm not going to view the various
|
|
milestones as hard deadlines. I believe the overall goals of the project are quite ambitious given the time frame, but I think they are
|
|
doable. If time becomes an issue, I will alter the goals so that at least a basic prototype allowing us to demonstrate the usage scenario is
|
|
ready.
|
|
|
|
Week 1 (hours 0-40)
|
|
|
|
- Data collection and screen scraping completed
|
|
- basic web page scraping
|
|
- follow RDF and LSID links
|
|
- scrapers stored and registered in Queso
|
|
- simple UI for selecting scrapers and adding new ones
|
|
|
|
- CR design finalization and implementation
|
|
- load Data from parsed RDF/XML
|
|
|
|
Week 2 (hours 40-80)
|
|
|
|
- notify user the data has been scraped
|
|
- implement the table and single view to let user browse scraped RDF
|
|
- augment the table view so that user can save/tag graphs in view
|
|
- implement the service that receives saved NGs and keywords
|
|
insert saved NGs and keywords into the proper named graphs
|
|
- impelement the keyword search services
|
|
- type-ahead keyword service
|
|
- ng retreival service
|
|
|
|
Week 3 (hours 80-120)
|
|
|
|
- design and implement the overall search framework and UI
|
|
- get keyword search working
|
|
- call the keyword typeahead search
|
|
- call the NG index service
|
|
- retrieve all the NGs over APP
|
|
- load them into the temporary CR
|
|
- display in table view
|
|
- allow user to include various rows in main CR
|
|
- get sparql search working
|
|
- figure out how to refine the search or leverage
|
|
a secondary Resouce search to get statements in JSON
|
|
- implement the JSON -> CR transform
|
|
- show the results to the user, and allow them to add to main CR
|
|
|
|
Week 4 (hours 120-160)
|
|
|
|
- implement overall main viewing framework with tabs, etc...
|
|
- make sure table view works on main CR
|
|
- view the main CR, and explore using keywords etc...
|
|
- select a resource and show in single view
|
|
- table view
|
|
- tree view
|
|
- arrange resource in a tree
|
|
- each child of a node represents a different axis of refinement.
|
|
- get basic example working with a couple different axes of refinement
|
|
|
|
Week 5 (hours 160-200)
|
|
|
|
- Data export
|
|
- Saved graph management
|
|
- named graph maintenance
|
|
- inference
|
|
|
|
Week 5.5 (hours 200-220)
|
|
|
|
- finish unfinished tasks
|
|
- Extra scrapers
|
|
- RDFa scraper
|
|
- Griddl scraper
|
|
- NCBI scraper
|
|
- Extra views
|
|
- link view
|
|
- map view
|
|
- full text indexing in Boca
|
|
|
|
|
|
@
|