wiki-archive/twiki/data/GUID/NalaDemo.txt,v

694 lines
46 KiB
Plaintext
Raw Permalink Normal View History

head 1.2;
access;
symbols;
locks; strict;
comment @# @;
expand @o@;
1.2
date 2007.07.14.13.56.46; author BenSzekely; state Exp;
branches;
next 1.1;
1.1
date 2007.07.14.00.13.13; author BenSzekely; state Exp;
branches;
next ;
desc
@none
@
1.2
log
@none
@
text
@%META:TOPICINFO{author="BenSzekely" date="1184421406" format="1.1" reprev="1.2" version="1.2"}%
%META:TOPICPARENT{name="WebHome"}%
For the Nala Demo design project see NalaDesign
@
1.1
log
@none
@
text
@d1 1
a1 1
%META:TOPICINFO{author="BenSzekely" date="1184371993" format="1.1" version="1.1"}%
d4 1
a4 647
In Section 0 we begin with preliminary discussion of the components we will use. We also discuss why we not using alternatives such as
Piggy Bank. In section 1 we begin with a basic description of how the system will work, end-to-end with architecture diagrams included. In this section, we will
detail the design decision and implementation techniques as much as possible. However to avoid disrupting the flow, we will include more
complete discussion of certain decisions and systems in Section 2. In Section 3, we include technical descriptions of the necessary
components. In section 4 we describe the development plan along with rough implementation estimates.
Section 0 - preliminaries
For lack of a better name, I initially call the system 'Nala' after the baby South American Caique parrot that is
living in my study for a couple weeks. She can't talk yet but is really good at debugging Javascript. (She just pulls
the keys off my Thinkpad keyboard with incredible speed if I turn my head for a second).
'Queso' refers to the web application framework running atop 'Boca', the RDF database. The main features we will
use in Queso are the Atom Publishing Protocol and SPARQL endpoints.
Before delving into the system design, we briefly discuss some existing tools and why have decided to use or not use them.
Piggy Bank
On the surface it appears that Piggy Bank has does almost everything we need of our application. It is a sophisticated system
that can detect RDF in pages via links and scraping. It stores locally in sesame, but can share data in remote semantic banks.
Both the local and remote semantic banks can be queried and browsed using facetted browsing and query. It uses the longwell faceted
browser and from we can tell, it is heavily intertwined in the longwell framweork.
Pros
- nice faceted browsing and search
- provides many of the required features of the project as is
- map/timeline
- is a firefox plugin
- extensible screen scraping framework
- has some form of remote storage
- it could almost be used as is.
Cons
- large, complicated, not very well documented code base
- backend store is not documented or easily accessed via API
- no built-in extensibility points for views or backends
- though it might be passible as is, we would only get it 'as-is'
as extending and hooking in to the large codebase might be
too difficult in the time of the contract
- a large part of the savingsin using piggy bank wouldbe for any UI
work they have done, but the UI does not satisfy the requirement of the
application as is.
- we'd have to modify code in the backend/java..this increasing the dev cost
by a lot
- even if we could hook in to their backend, or modify their front end
to user our backend, maintaining this code would be difficult.
- It does not appear that development on Piggy Bank is continuing aside
from small bug fixes
Queso, Boca
Boca is an open source RDF store developed by my former team at IBM. It has such enterprise features as revision tracking, access control
transactions, sparql query and a named graph data model and API. Atop Boca runs Queso, a Semantic Web application development platform allowing
simple Ajax/REST access to Boca named graphs and Sparql engine from javascript applications. Using these systems, we can build our
application rapidly because we won't have to write any service side code, and I'm quite familiar with the codebase should and
bug fixes or new features be required. Another great advantage of using Boca is that is has a very sophisticated programming model and
Web Service based access API. If our application takes off, and people are using it to generate large amounts of RDF, it will be much
easier to leverage that data using Boca.
Client Repository/Views
By not using Piggy Bank, we are on our own for implementing a client-side repository. The Client Repository, or CR, will be our very
simple Javscript-based local RDF store. At first, this store will reside in memory, but eventually we should be able to serialize it
to disk using JSON. We are also left to build all our own views. However, we think with a sufficiently simple and userful abstraction
in the CR we can built out the views defined in the requirements section. Graphically the views may not be incredibly sophisticated
but they should get the job done.
Section 1 - System Design
Section 1.1 - Data collection
In this section we describe in detail on how RDF data finds its way into the system. The named graph is the atomic unit
of stoage and management in our application. This works well because the named graph is the atomic unit in Boca and Queso,
significantly simplifying our design and implementation.
- When a webpage loads the screen scraping framework is invoked. Based on the URL and other possibly other
factors, a set of screen scrapers are chosen to run on the page. The scrapes fall into two categories
1 - code that looks for links to RDF sources
2 - code that produces triples directly from the page
We'll discuss the detail of how scrapers are registered and stored further on.
- we arbitrarily invoke the link scrapers(1) in some order, but maintain a collective set of RDF urls that
they find since some scrapers might produce the same url(s). For each URL we fetch it with an XMLHttpRequest.
- To save work on the client, and because the server will validate anyway, we assume the links contain valid RDF. For the RDF responseTxt
from each XHR we create an Atom Entry with the RDF as the content and content type RDF/XML, or whichever RDF content type the link
contained. (We may have to modified the Queso server to accepted additional RDF content types). The effect of posting the Atom Entry to the
Queso server is that each RDF source from the page will be automatically inserted into its own named graph. Either through adding extra metadata
through the Atom Entry XML (which will get transformed on the server) or inserting extra triples into the RDF payload, we can add additional
triples to the named graph. These might include the source URL, the scraperid, the current page that invoke the scraper and anything
else we deem necessary.
- Because we store all the provenance information for each graph: the URL, the scraper used, the date, etc...we can use sparql
ASK queries to determine if a page should be rescraped.
- For the purposes of walking-the-walk, I think we should be able to handle LSIDs at this point. We can setup a dedicated web-resolver
for our application and just grab the LSID metadata via a simple http get. This will show people how not having a full LSID client stack
is not a show stopper.
- Next, we invoke the triple scrapers (2) in some order. Just like each URL had its own Atom Entry, the set of triples generated from each
scraper will be put into an Atom Entry along with additional metadata as above. To do this, we'll make use of an RDF library to build
up and then serialize a collection of triples.
- Here is an example of an Atom Entry created on the client. Queso will add extra atom elements before it serializes it to
RDF. The Atom Entry is in fact stored in RDF in Boca. When requested, Queso can easily recreate the Atom Entry from the RDF triples.
The RDF triples generated from the Atom Entry as well as the triples parsed from the content are all stored in the same named graph. (see discussion)
<entry xmlns="http://www.w3.org/2005/Atom" xmlns:xml="http://www.w3.org/XML/1998/namespace" xml:base="">
<id>urn:qid:test.boca.adtech.ibm.com:entries:2050832554</id
<title type="text">APP Named Graph Entry</title>
<updated>2007-07-09T23:02:32.734Z</updated>
<author>
<name>APPNamedGraphService</name>
</author>
<link rel="nala-page" href="http://www.klinewoods.com/" />
<link rel="nala-src" href="http://www.klinewoods.com/foaf.rdf" />
<content type="application/rdf+xml">
<rdf:RDF xmlns:foaf="http://xmlns.com/foaf/0.1/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:wot="http://xmlns.com/wot/0.1/">
<foaf:Person rdf:about="urn:bszekely">
<foaf:name>Ben Szekely</foaf:name>
<foaf:mbox rdf:resource="mailto:bszekely@@gmail.com"/>
</foaf:Person>
</rdf:RDF>
</content>
</entry>
- The id in the <id> element is initially generated by the client, but in the current implementation of Queso, is replaced by one
that the server knows is unique, though in a format that is somewhat configurable. This id will be the id of the named graph containing the
Atom triples and content triples. We'll have a more detailed discussion of ids at in the discussion section. In particular, the names
of graphs and where we can use LSIDs.
- The two link elements specify where the RDF comes from. The first, "nala-page" is the page the triples or links were scraped from.
"nala-source" is the source of where the RDF came from in the case of (1) link scrapers. The URI of a triple scraper probably doesn't
belong in a link element. We'll most likely have to design a special RDF triple or Atom extension for this.
- We will add specific titles to the links as well as the entry which we will use to display them in lists and search results
- In the Atom Publishing Protocol (APP), entries must always be posted to a collection, specified in the POST url. For example:
http://localhost:8080/atom/urn:qid:nala.gbif.org:collections:namedgraphs
would post to a collection with URI urn:qid:nala.gbif.org:collections:namedgraphs. Internally, the collection is a named graph
that contains links to all the entries. In addition, however, it is a mechanism for providing Atom Feeds. In our application
it probably makes sense to a have a collection per page. This will allow users to track metadata collected about a page without
necessarily having to revisit that page themselves. We'll explore this idea more later.
- For the version of the application completed in Phase 2 of the contract, it probably makes sense for all named graphs to be publicly accessible
for ease of demonstration and debugging. In the discussion section below, we will discuss various security options and how they can
be implemented atop Boca's security model. Note that having an open security model does not prevent users from having working sets stored
on the server and any other per-user information. It just means that this information will be open as well. If Ricardo and Lee feel that
security is a priority for the first cut, we can make it so.
- Ok, so we have now posted one or more Atom Entires to Queso containing RDF scraped directly from the page or from links
contained in the page. Queso, as a response to the POST, will send back a beefed up Atom Entry with additional links elements, and
date elements. The RDF content will also contain the transformed Atom Entry in RDF. This may seem a bit weird, but the content of the
named graph now includes those triples.
- At this point in the application, the page has been quietly scraped for RDF, and everything has been uploaded to Queso, all without
the user having to do anything. At this point, we will alert the user, via icon in the status bar or toolbar, that RDF has been discovered.
Now, we have a temporary Client Repository with all the various collections of triples from links, and scrapers. As we scrape and POST, we can
easily maintain such a repository with a resource, with its properties and relationships, as well as metadata such as
keywords, the source page, the named graph(s), the date it was fetched/created etc..
Once all the scraping, fetching and posting has been completed, we can present the user with a table of all triples, partitioned (maybe with tabs)
by the namedgraph URI, and organized according to the Table View defined in the requirements list. Many approaches exist for presentation here,
but the bottom line is that we will give the user the chance to save all or part of the discovered RDF. The user can tag the saved name
graphs with keywords, either individually or on a whole.
- Once the user has selected the named graphs to save, and has ascribed optional keywords to them, we will call a special service that will
receive this information, and add the named graphs, by URI, to the user's collection in Boca. The service will also add the keywords to the index.
The service will be a simple HTTP post that contains a plain list of namedgraphs and tags
ng1 tag1 tag2 tag3
ng2 tag4 tag5 tag6
The request will be received by an HTTP Servlet that will interact through the Boca named graph API to
1) add the named graph to the user's collection of saved graphs
2) add the keywords to the keyword named graph, global for all users.
The graph in 2 will be the authoritative and permanent store for keywords, but will not do for fast lookups..(see below)
- Now, we can always maintain the map of ngrui->triples to facilitate querying later on, but it isn't clear how this cache will be useful.
Section 1.2 - Data retrieval techniques
Before proceding further, it will be useful to discuss the various approaches we will use for getting data out of the system. These
will be used heavily functionality described in the sections that follow.
- APP Named Graph requests
We can easily retrieve the contents of a particular named graph in RDF/XML. In many cases we will use this for showing
simple Table or Single views. However, in general, it is difficult to manipulate RDF on the client as there is limitted query
support and performance is not great. One advantage of this data access method is that the APP endpoint employs some sophisticated
caching mechanisms.
- Sparql Queries
This will be our prevailing approach to fetching data from the server. SPARQL queries can be issued against a single named graph,
a collection of named graphs, or all named graphs together. The results can be returned in a number of formats, the most convenient
of which, for our purposes is Java Script Object Notation, or JSON. JSON can be diretly evaluated into JS objects and easily traversed
to display the information. In addition, libraries such as Exhibit and Dojo make the display of JSON data quite easy.
Section 1.3 - Working Set management
** After thinking it over, we believe that the working set was intented to be what we below refer to as the Client Repository,
something that we are building up over searches and queries in the memory of the browser, not a specific set of named graphs that needs to be maintained over time.
We leave the design the Working Set Manager here in case we want to implement it later, but I believe that it is not necessary after
further examination of the initial requirements **
** We will need a similar system that can be used to manage saved named graphs. We will retain the parts of this design that
pertain only to saved named graphs **
In this section we discuss how the user manages his Working Set. First we discuss the difference between his Working Set
of named graphs, and his Saved Set of saved named graphs. The Saved Set is simply the set of graphs that the user wishes
to remain in the system and not get deleted during maintenance (see below). The Working Set is the set of graphs the user
is paticularly interested in at the moment, i.e. the ones being used to solve a particular task. In fact, it may make sense
to allow multiple working sets of graphs.
- From the toolbar or other sensible location, the users selects to "Manager Working Sets"
- The Working Set Manager (WSM) will be an HTML page inside the plugin that has several column views.
- The leftmost view will be a list of all Working Set names. The data will be retrieved by SPARQL
The user will be able to add or delete working sets from this column. These operations will immediately
go to the server and perform the change.
- When a working set is selected in the left column, each named graph in the set will be displayed in the next columm
over, again using Sparql. Instead of showing the URI, we will display a human readable title from the Atom Entry. The sparql
query can easily pickout this title from the named graph. The user can can select a group of graphs to "unsave" if he
deems that he will never be interested in them again.
- The third column will contain a list of named graphs in the users saved set that are candidates for the working set. When we
fetch these from the server using sparql, we can join with the keyword index and bring down all keywords for each named graph
in the saved set. The user will then very quickly be able to refine this list via incremental keyword search.
- Once the user is satisfied with the state of the current working set, he can push the changes to the server using a special
request similar to the one that occurs when the users saves browsed RDF.
- The user can also choose to export the data to tab-seperated or CSV format. It is unclear whether that should be a function
of the WSM or a function of the table view, but either way, some sort of shortcut will be provided here.
- Now it might be tempting here to add additional functionality to the WSM such as finding new named graphs to add to the saved set
and then maybe the working set, but this would be overloading the WSM, and encroaching on the jobs of the other bits of the
application.
Section 1.4 - Data query and viewing
The details of how data is searched, queried, browsed and viewed will most likely evolve as the application is developed. The design
and functionality described here is merely our starting approach, and first attempt to meet the outlined requirements.
- The views of the data fundamentally provide a view of many resources and their properties are relationships. Internally, the data is
stored in the Client Repository as javascript objects: a resource, with its properties and relationships, as well as metadata such as
keywords, the source page, the named graph(s), the date it was fetched/created etc..
*All views: Tree, Table, Map* and Link* are fundamentally all views of this core data structure.
** After further evaluation, we feel that the Client Repository fullfills the requirements of the Current Set better than the
WSM system designed in 1.3
Tree View - The top level of the tree represents all facets on which we want to browse the data. For now, these are hard-coded
facets such as source page, scraper, date, keywords. When the tree view loads, it builds itself at runtime based on the
current state of the Client Repository. When a branch is opened, such as keywords, the next level allows us to show
all resources find so far, and the # of them, or futher restrict it like at the root. This continues until the user is
satisfied of the restriction. when a resource is selected, we
can move to the single view which shows all the available information about the resource as well as where it came from.
Alternatively, the whole set of resources can be selected and shown in table view. This whole process is equivalent
to showing a Table view, refined using the search bar at the top with keywords, source, etc...
Table View - A table view shows a table of resources, one row per resource. Each column represents a proprety. If a resource has
no value for that property then the cell is blank. At the top of the table will be a method for the user to
refine the contents of the table by keyword, source, date, or other metadata. He can also add or remove columns from
the table.
Link View - All resources in the current Client Repository will be displayed as nodes in a graph, and with labeled edges between them.
Map View - All resources in the current Client Repository that have geographic information will be displayed on the map. This view must
be extensible so that different types of geographical information can be extracted i.e. different ontologies for long-lat coords,
city-state-country representations, etc...
- The Client Repository, a fancy name for the javascript object structure, must be carefully designed to allow what we want.
- While the data on the server is stored by named graph so that data provenance is fully maintained, on the client, the requirements of the application
dictate that the Resource is king, i.e. the subject of each triple. So the top level object will be an associative array keyed by
Resource uri.
- Now we have a decision to make. do we want keywords, named graph, source and other metadata associated with triple or with
each resource. The former is clearly much more expensive, but the laters will allow the use to know only the set of keywords associated
with a resource, or the set of namedgraphs and web pages that the resource is associated with, not each statement. We'll go with the relaxed
approach, assuming that it will be sufficient. Afterall, the complete history of every statement and namedgraph lies in the Boca and can be
specially queried with SPARQL if necessary. Thus the lists of keywords, dates (we'll do some clever aggregation here), named graphs, source
pages, etc.. will each get an array hanging off the main object.
- Next, we'll have an array of statements. There are a number of ways to do this. One approach will be to have an associative array
whose key is the full URI of the predicate, and whose value is an array of the objects. We'll start with this approach and modify
it as we go forward.
- The CR is convenient LCD approach to providing a data representation to each of our views, as well as way to temporarily display search
results and captured RDF.
- Populating the client repository
- Sometimes we'll use a CR data structure for one-off jobs such during data collection when we want to present to the user a table view of all the data
scraped from the current page. In this case, we'll just build a client repository and then view it, and then destroy it when we are done
(though not before possibly caching the data in the main CR).
Other times, we'll want to keep the client repository around during a user's session, maybe even serialize it to disk to be retrieved
when they come back online. This will be a bonus feature if there is time. However this use of the CR is what the requirements list
refers to as the Current Set.
- The first technique will be from a complete named graph in RDF/XML via APP. Here we'll use the 3rd party RDF parser to get all the triples,
but then iterate through and put them in our store, taking node of keywords etc..as they are available, if at all.
- The second technique is through SPARQL queries. Here, depending on what the users supplies, we may not get keywords or other metadata
unless we either query for it after, or force the user to query for it. We will write a very simple routine that takes SPARQL results
in JSON and converts them to CR format. We can then offer the user the option of filling in the missing provenance information through
extra queries which we will generate automatically for them.
- Search and getting new data
So now the user has just browsed a bunch of pages, and has quite a bit of RDF saved in Boca. Depending on how we implement it,
they may already have some cached stuff in the main CR as well. There are a number of ways for the Scientist in our scenario to do his search.
No matter what the search, the scientist may choose whether or not he is searching the whole datastore or only his saved data.
The search area will be an HTML page that has different tabs or modes that allow for the different types of searching.
Once a search has been completed, the user might not want all statements immediately placed in the CR. In practice, we'll most likely
build up a temporary CR (like in data collection), show it to the user via Table View and let them select which resources go into the main CR. This
process will happen exactly once per search. The selection process can involve keyword or other search on the table view of the
temp CR
After this final step, the user can do another search, or select a view of the main CR to browse the data, and possibly export.
- Keyword Search: The global keyword index on the server can very quickly return all the namedgraphs for a particular keyword. Depending
on the scope restriction, we'll have to do a quick sparql query on the server to remove the unncessary graphs, and then return them to the
client with all the information needed to populate a temporary CR. If the user issues multiple keywords, we'll let the users
specify if it is an AND or an OR, but nothing complex like "k1 AND (k2 OR k3)". One approach that we'll try initially here, is that
the keyword search will just return the named graphs, and then the client will issue APP requests for the RDF itself. This will
enabled us to leverage the caching functionality of Queso.
- Text search: Boca has a built-in Lucene-based text indexer, but its performance and realability questionable. It works just like the keyword index
we'll build works. You give it a search string, and it returns matching named graphs. The indexer indexes on all literal values. If we
have time, we'll try to leverage this search feature for the Application
- Sparql Search: Using the sparql endpoint, we can issue queries from the client to the server, getting the results back in JSON. Depending
on the query the user issues, we may be able to issue additional auto-generated queries to fillin keywords, source pages, scraper data
etc...The user may not need this information anyhow. After the sparql query has been issued, we'll transform the JSON and insert the
resources and statements in the temporary CR. We may have to restrict the type of sparql queries we allow the user to issue in order
that they fit nicely into the CR datamodel. Alternatively, we can allow a loose sparql query, when the data comes back, allowing
the user to perform a Resource search, or further explore the data. We will refine this design as we move forward with implementation.
- Resource search: if a user encounters a particularly userful resource, he can issues a request to the server to find all
named graphs that mention that resource. This is just a special case of a sparql query, but it deserves special treatment in the UI.
The search results of this search result will be shown in a single view.
* New named graphs encounter via search may be added to the user's list of saved graphs.
Section 1.5 - Scraper registration and storage
As discussed in 1.1, we will have two different types of scrapers, those that search for links to RDF, those that extract RDF triples
themselves. There are numerous security risks involved with allowing people to register JS code that can run. However, there are ways
to properly sandbox the code that we will make use of.
- Each scraper implementation can depend on certain objects that it can access in order to do its work.
- The code for each scraper will be stored as an Atom Entry in Queso/Boca. When the application loads, we can easily
find all scrapers using Sparql queries.
- Scrapers can have metadata associated with them that helps determine which websites they should be applied to.
- we'll have a simple page for registering scrapers and loading them into queso
Section 1.6 - Data export
Data export should be a straightforward operation atop the content repository. In the first cut of the application we will implement
export atop the Table View. The user will be allowed to select all or a subset of the rows in the table and choose to export them in one
of the available formats. We will allow comma-seperated-values (CSV) and tab-seperated-values (TSV) to start with. The user can also
specify whether or not they want all predicates exported or just the selected ones. In general, all predicates will yield a very sparse
export.
- given an array of resources and predicates, we can easily generate the proper output.
Section 1.7 - Keyword-named graph indexing
In data storage systems where full text search is difficult to implement or not performant, keyword search is often employed
to provide a quick alternative. To support this in our application, we will build keyword indexing as an auxilliary service atop Queso
and Boca, much like the saved-graphs service defined in section 1.1. There is a slight difference in that the keyword indexing service
might be something that belongs in Boca/Queso and is not specific to our application.
- Keywords will be permanently stored in a global named graph such as urn:qid:nala.gib.org:system:keywords. We don't spread the keywords
over multiple named graphs because of the overhead of reading through all of them while building indexes. We have a couple different
choices for how this named graph is arranged:
(1) Flat: a simple flat list of triples of the form <nguri> <queso:keyword> "keyword1"
(2) Per NG: <nfguri> <queso:keywords> <_bnode:1>
<_bnode:1> <queso:keyword> "keyword1"
<_bnode:1> <queso:keyword> "keyword2"
(1) allows us to issue the query for "all NGs for a keyword" or "all keywords for a namedgraph" withoug
performing any joins. Indexes should help with both of these queries.
(2) queries for all ng's for a keyword will be expensive.
Also, we will not be querying this graph directly when a user issues a keyword search. This graph will be used mostly to
build indexes.
we procede with choice (1).
- It's very easy to build storage-efficient in memory indexes for keyword searches. We'll actually take a two-index approach
Type-ahead search index (TASI): As the user types keywords, we want to help auto-complete them. To facilitate this, we simply keep
a sorted array of all the keywords on the server and as the user types a character, we send the word typed so far (prefix) to the
server via Ajax. We then do two binary searches on the array to find the range of keywords that match the prefix and return them.
Named-graph index (NGI): Once the user has chosen the keyords, we want to be able to quickly find the corresponding named graphs. The reason
for this is that SPARQL queries across the whole database are very slow, so if we can a priori narrow down the set of named graphs
for the query, we can reduce the scope and much more quickly return the content from those graphs.
The difficult bit here is how to update these namedgraphs as more namedgraphs are tagged with keywords. It basically boils down
to inserting values into a sorted array, which is O(nlogn). If the number of keywords isn't huge, this shouldn't be a problem. For the demo
we'll simply add the new values to the end of the array and call Arrays.sort(..) and see what happens. In practice, we'll probably want
to copy the array before insertion so we don't foul up queries that happen during the sort.
- to summarize, the steps of keyword search are as follows:
- in the search bar, the user types multiple words. As the user types each word, it does queries against the (TASI), helping the
user discover keywords that will yield any results at all.
- the application then sends the keywords to Queso, which uses the NGI to find all the named graphs for each keyword and computes
the union or intersection depending on AND or OR. It then returns just this list of NGs to the client.
- the application then issues named graph requests over APP for each of the named graphs, parses the request using the 3rd party
parser, and inserts the triples into the tempory CR for review by the user.
Section 2 - Extended Discussion
In this section we discuss some design and technical issues that didn't make sense to delve into in Section 1. Some of the discussion here
raises issues that we will not necessarily deal with in the demo version of the application, but that should be at least brought up to be
addressed in future versions. Most of these issues come in various places in the application and affect the design of the system all the way
through the stack.
Section 2.1 - Caching
Caching will be very important to the scalability and performance of our application. Data may be cached both in memory on the Web server, and
in the browser extension. Some caching is provided for us by existing components and some caching occurs inherintly by the behavior of the system.
At this point, we do not see the need to build additional caching mechanisms.
- client-side caching: The Client Repository is our client-side cache. All of the views in the application operate on this local data
instead of going to the server for each view. This inherent caching mechanism is crucial to perfomance.
- server-side caching:
- Queso cache: All requests to Queso/Boca via Atom Publishing Protocol (APP) are automatically subject to caching. Since much of the
data will be brought down to the application via APP, we will benefit automatically from its caching mechanism. An interesting property
of our system is that named graphs are write-once, read-only. This is not a restriction of Queso or Boca, but merely a concequence of
our application. Every bit of scraped data gets its own named graph. The consequence is that the data can be cache indefinitely.
- keyword search: This is in fact just a cache of the special-case query for keywords.
Section 2.2 - named graphs/duplicate data and ids
Every application that makes use of a named-graph based triple store must decide how to partition the data across graphs. In our application,
each set of triples accumulated by a particular scraper or triples fetched from an RDF link receives its own named graph. This makes updating
the data store very simple: find some triples, create a named graph for them. It is not without its problems, however. Right now, if a page
is scraped and saved repeatedly, we'll have duplicate data for the page in the system, possible within the saved graphs of a single user. Because
we store all the provenance information for each graph: the URL, the scraper used, the date, etc...we can use sparql ASK queries to determine if
a page should be rescraped. These queries can be fairly expensive because they must be posed on the entire datastore. We now discuss a few
alternative named graph data models, and why they were not selected.
One alternative to this approach was to have each resource get its own named graph. Then, adding duplicate data would be a no-op. The problem
here is that we lose all provenance about where data came from. In particular, if data about a particular resource is conflicting, sticking it
all in one graph could create false information. Solving this knowledge consistency problems is beyond the scope of the application.
Another approach would be to have a named graph per web page. This works OK for static web pages. As the web page changes, we would eventually
rescrape it and then we'd just get a new version of the named graph. However, data from different scrapers may create conflicting triples. This
information needs to be seperated out. The picture gets even more complicated when we have dynamic web pages, or javascript based web pages.
When all is said and done, the safest approach is just to have each self-contained piece of RDF reside in its own named graph. We can handle
duplicate data by not rescraping when we don't have to, and by handling duplicates at the CR level on the client when we display the information
to the user. If triple explosion becomes a problem, we can re-evaulate our approach or engineer new ways to avoid duplicate graphs.
A somewhat convenient corollary to this is that the URIs for named graphs can be anything we like. We can simply allow the Queso system
to generate named graph URIs for us, and the application will just treat them opaquely.
Section 2.3 - Security
When we speak about security, we don't mean making the web server hack-proof or worrying about malicious scripts running in the browser. These
are all important issues, but not of immediate concern. What we really intend to discuss here is the security model for our application.
As a demo application, its ok for all named graphs and keywords to be public information. However, in a real deployment this might not
be the case. If people are browing private websites, they might not want to upload private RDF into a public database. Furthermore, keywords
might themselves be private.
One possible security model would be that each named graph in the system is created under the user of the application. That user can then through
Boca/Queso management tools, decide to give permission to other people to see their graphs. The problem with this is that under normal
operation (which is what 99% of users do), most data would be partitioned by user, reducing the collaborative benefit of the system.
Section 2.4 - maintenance
Because we retain all RDF that we browse, things could potentially start to get out of hand quickly. We would really like to have a mechanism
for removing data that isn't saved after some period of time. Unfortunately, Boca does not have a way to delete named graphs. We can clear the
contents of the named graph, but the triples will still exist in the history tables. We will leave our design as is for now, but we may need to
either modify Boca (not preferable), or revise our design so that only the explicitly saved named graphs are stored in the database.
Section 2.5 - inference and cross-mapping types
Inference is one of those tricky features of a Semantic Web system that is difficult to get working right. We see two basic approaches to
inference in our system. First, we can rely on Boca to infer triples based on owl:sameAs. I believe it does this, but my buddy who implemented
that is away now. When he gets back I can ask him how it works. I believe it should provide the rudimentary behavior outlined in the requirements.
The second approach to inference would be to do something simple in the Client Repository to entail extra triples in the views. The benefit of this
approach is that the rules could be applied per user. The rules themselves can be stored as Atom Entries in Queso, and thus shared, but application
of the rules would be per user, as is *not* the case in Boca.
Section 2.6 - LSIDs
Thus far we have identified one place where we can make use of LSIDs in the application. We can pick out LSIDs when we are scraping and
attempt to resolve their metadata using a web resolver.
As we pointed out earlier, named graphs as they are accessed by the application, are write-one read only entities. Thus, we could use LSIDs
as named graph URIs in our system, and add an LSID resolver atop Boca/Queso. The main challenge would be to figure out extra triples to return
to present a meaningful web of LSIDs to browse.
Section 3 - Technical components and libraries
Here is a condensed list of some of the required technical components. It's more of a summary of the previous section.
- RDF Store and REST API
- Boca
- Queso
- Atom Javascript library
- generate and parse valid Atom entries
- Atom Publishing Protocol client library
- Javascript RDF library
- To parse RDFXML and collect pure RDF we'll use the 3rd party library available at
http://www.jibbering.com/rdf-parser/ for some
- RDF-XML serializer
- Client Repository
- load from RDF triples
- load from json
- Sparql library to issue queries against Queso/Boca
- sparql.js
- Data export mechanism
- Query interface
- keyword
- sparql
- resource query (in views)
- Views
- table
- tree
- single
- REST services
- graph save/keyword tagging service
- keyword type-ahead service
- keyword search service
Section 4 - Development plan
Of the remaining 5.5 weeks, or 220 work hours in the contract, I will be working full time for about 3 weeks, and then part time to finish it off.
I'm still splitting the work into 40 hour increments though they may not fall on boundaries of actual weeks.
My goal is to submit the application for final review by the end of August. The timetable below is a rough for completing the various parts
of the application. The purpose is really to let me know if I'm spending too much time on one thing, but I'm not going to view the various
milestones as hard deadlines. I believe the overall goals of the project are quite ambitious given the time frame, but I think they are
doable. If time becomes an issue, I will alter the goals so that at least a basic prototype allowing us to demonstrate the usage scenario is
ready.
Week 1 (hours 0-40)
- Data collection and screen scraping completed
- basic web page scraping
- follow RDF and LSID links
- scrapers stored and registered in Queso
- simple UI for selecting scrapers and adding new ones
- CR design finalization and implementation
- load Data from parsed RDF/XML
Week 2 (hours 40-80)
- notify user the data has been scraped
- implement the table and single view to let user browse scraped RDF
- augment the table view so that user can save/tag graphs in view
- implement the service that receives saved NGs and keywords
insert saved NGs and keywords into the proper named graphs
- impelement the keyword search services
- type-ahead keyword service
- ng retreival service
Week 3 (hours 80-120)
- design and implement the overall search framework and UI
- get keyword search working
- call the keyword typeahead search
- call the NG index service
- retrieve all the NGs over APP
- load them into the temporary CR
- display in table view
- allow user to include various rows in main CR
- get sparql search working
- figure out how to refine the search or leverage
a secondary Resouce search to get statements in JSON
- implement the JSON -> CR transform
- show the results to the user, and allow them to add to main CR
Week 4 (hours 120-160)
- implement overall main viewing framework with tabs, etc...
- make sure table view works on main CR
- view the main CR, and explore using keywords etc...
- select a resource and show in single view
- table view
- tree view
- arrange resource in a tree
- each child of a node represents a different axis of refinement.
- get basic example working with a couple different axes of refinement
Week 5 (hours 160-200)
- Data export
- Saved graph management
- named graph maintenance
- inference
Week 5.5 (hours 200-220)
- finish unfinished tasks
- Extra scrapers
- RDFa scraper
- Griddl scraper
- NCBI scraper
- Extra views
- link view
- map view
- full text indexing in Boca
@