wiki-archive/twiki/data/TAPIR/PyWrapperAlgorithm.txt,v

742 lines
19 KiB
Plaintext

head 1.23;
access;
symbols;
locks; strict;
comment @# @;
1.23
date 2007.01.09.00.00.00; author MoinMoin; state Exp;
branches;
next 1.22;
1.22
date 2007.01.09.00.00.00; author MoinMoin; state Exp;
branches;
next 1.21;
1.21
date 2007.01.09.00.00.00; author MoinMoin; state Exp;
branches;
next 1.20;
1.20
date 2007.01.09.00.00.00; author MoinMoin; state Exp;
branches;
next 1.19;
1.19
date 2007.01.09.00.00.00; author MoinMoin; state Exp;
branches;
next 1.18;
1.18
date 2007.01.09.00.00.00; author MoinMoin; state Exp;
branches;
next 1.17;
1.17
date 2007.01.09.00.00.00; author MoinMoin; state Exp;
branches;
next 1.16;
1.16
date 2007.01.09.00.00.00; author MoinMoin; state Exp;
branches;
next 1.15;
1.15
date 2007.01.09.00.00.00; author MoinMoin; state Exp;
branches;
next 1.14;
1.14
date 2007.01.09.00.00.00; author MoinMoin; state Exp;
branches;
next 1.13;
1.13
date 2007.01.09.00.00.00; author MoinMoin; state Exp;
branches;
next 1.12;
1.12
date 2007.01.09.00.00.00; author MoinMoin; state Exp;
branches;
next 1.11;
1.11
date 2007.01.09.00.00.00; author MoinMoin; state Exp;
branches;
next 1.10;
1.10
date 2007.01.09.00.00.00; author MoinMoin; state Exp;
branches;
next 1.9;
1.9
date 2007.01.09.00.00.00; author MoinMoin; state Exp;
branches;
next 1.8;
1.8
date 2007.01.09.00.00.00; author MoinMoin; state Exp;
branches;
next 1.7;
1.7
date 2007.01.09.00.00.00; author MoinMoin; state Exp;
branches;
next 1.6;
1.6
date 2007.01.09.00.00.00; author MoinMoin; state Exp;
branches;
next 1.5;
1.5
date 2007.01.09.00.00.00; author MoinMoin; state Exp;
branches;
next 1.4;
1.4
date 2007.01.09.00.00.00; author MoinMoin; state Exp;
branches;
next 1.3;
1.3
date 2007.01.09.00.00.00; author MoinMoin; state Exp;
branches;
next 1.2;
1.2
date 2007.01.09.00.00.00; author MoinMoin; state Exp;
branches;
next 1.1;
1.1
date 2007.01.09.00.00.00; author MoinMoin; state Exp;
branches;
next ;
desc
@Initial revision
@
1.23
log
@Revision 23
@
text
@see [[#empty][below]] for the*empty repeatable region problem*
---+ PyWrapperAlgorithm
This algorithm is used by the PyWrapper to process a custom ResponseStructure for a CustomSearch.
Internally the PyWrapper uses [[http://www.biocase.org/dev/wrapper/CMF.shtml][CMF]] files with a very similar structure as the ResponseStructure definition to represent different ConceptualSchemas.
So the algorithm used for them and for a CustomSearch could probably be the same.
Imagine an example xml structure like this, with* marking repeatable elements and @@ marking xml attributes:
<verbatim>
A
B*
C*
D
E
F
G*
@@H
@@J
@@K
M
</verbatim>
For a given provider this structure is mapped to table fields as follows:
<verbatim>
A
B*
C*
D t1.fA
E t2.fB
F t1.fB
G*
@@H t3.fA
@@J t4.fA
@@K t1.fC
M t3.fB
</verbatim>
---+++++ The Algorithm
Determine for each RepeatableElement or "RepNode" the "locked" tables to loop over using their (compound) primary keys.
Top down approach:
<br>For each "RepeatableRegion" list all mapped tables that haven't been locked by a higher (parental) RepNode.
A RepeatableRegion are all elements between 2 RepNodes including the top RepNodes' mapped tables. The root region is the root element and all child elements until the first encountered RepNode.
For the above example this would be:
<verbatim>
root region
---------
A
M
rep region B
----------
B
F
rep region C
----------
C
D
E
rep region G
----------
G
@@H
@@J
@@K
</verbatim>
The tables involved are:
<verbatim>
root region
---------
A
M
-> t3
=> t3
rep region B
----------
B
F
-> t1
=> t1
rep region C
----------
C
D
E
-> t1, t2
=> t2
rep region G
----------
G
@@H
@@J
@@K
-> t1, t3, t4
=> t1, t4
</verbatim>
The root region is never repeated, as there should always be only a single root element.
Therefore t3 is not used in creating loops at all in this example.
<a name="empty"></a>
---+++++ The Empty Repeatable Region Problem
The only known problem so far is occuring when there is an empty rep-region above another rep region that contains mappings.
E.g.:
<verbatim>
A
B*
C*
D t1.fA
</verbatim>
In this case the algorithm suggests to use table t1 to loop at RepNode C to create multiple D elements:
<verbatim>
A
B
C
D1
C
D2
C
D3
</verbatim>
But this might be a wrong interpretation and actually RepNode B should have been looped over:
<verbatim>
A
B
C
D1
B
C
D2
B
C
D3
</verbatim>
A more real life example would be:
<verbatim>
DataSet
Specimen*
Identification*
FullName t1.attrA
</verbatim>
If we have a single table with each record representing a specimen, containing only one identification as a single field, the algorithm would interpret the different names of each record as multiple identifications for a single specimen. But remeber that this is only a problem because there is an empty RepeatableRegion "Specimen". If we add for example the catalog number to the structure, we would not face this problem anymore (unless a provider doesnt have a catalog number mapped) and the algorithm would use <Specimen> for its loop:
<verbatim>
DataSet
Specimen*
CatalogNumber t1.ID
Identification*
FullName t1.attrA
</verbatim>
On provider configuration level in BioCASE it is possible to manually set a "don't repeat" flag for a RepNode that does not make any sense for the specific provider's database structure.
In the example above one could tell the wrapper not to consider "Identification" as a RepNode.
In the case of the CustomSearch in the protocol, this cannot be defined as we are dealing with different providers with different database structures and dynamic requests.
We think there is a better solution by analyzing entire repeatable regions instead of sole concepts. A conceptual binding for a RepNode could also help in solving this problem. But even this method has its limits and some requests depending on missing mappings will have no results returned cause the wrapper has to return an error in such cases!
---+++++ Answers to the empty rep region problem
*The problem again:*
The main problem is when we face a structure definition, analyze the concept-bindings and local database mappings and then are confronted with an "empty repeatable region":
<verbatim>
A
B*
C*
D ->mapped to table1.attr1
E
</verbatim>
In this case B+E and C+D form repeatable regions. So if B is repeated, E needs also to be repeated.
Since there is no indication of a table belonging to this region, table1 can be used for looping over B or C.
And that is the problem. We have no idea about the meaning of B or C and the results returned might not make sense.
*Entity Mapping*
EntityMapping would solve the secondary empty repeatable regions described below.
*Primary empty repeatable regions: *
So if we face a request containing an empty RepRegion in the first place (before evaluating the conceptual bindings and their underlying local mappings) this is clearly an error and can be rejected.
*Secondary empty repeatable regions: *
But what happens if we receive a request like this:
<verbatim>
A
B*
C*
D -> abcd:concept1
E -> abcd:concept2
</verbatim>
... and one of the provider doesnt have the dwc:concept2 mapped.
What should we do? We could raise a fatal error, but then it would be difficult to retrieve complex data from the entire network. Only core items everybody has mapped could then easily be retrieved.
So the problem arises if we have RepRegions with only one or few concepts attached to it that might not be mapped by providers. To solve this we could asign to a rep node a list of concepts which are all considered to define the idea behind this rep node or even use an existing rep node of a conceptual schema that defines them like ABCD.
As an extensive list of concepts is not really a possible solution for custom response structures that is not feasable and maybe not even needed. So instead of attaching a list of concepts to the rep node, it would be enough to assign it to the appropiate rep region of a known standard. But as a "primary empty rep region" in a request is an error, each rep region already has at least one concept assigned to it. And this can then also be used to identify a rep region of the conceptual schema. So we would not need any further domain/concept assignments to branch nodes, but rather analyze the rep regions referred to by the normal conceptual binding.
So in the above case:
<verbatim>
A
B*
C*
D -> abcd:concept1
E -> abcd:concept2
</verbatim>
if abcd:concept2 is not mapped by a provider, we take a look at the rep region of that concept in the original standard=conceptual schema:
The excerpt of the underlying conceptual schema showing only the RepRegion for abcd:concept2 with abcd:concept3 being the repeatable element of that region:
<verbatim>
...
abcd:concept3*
abcd:concept4
abcd:concept2
abcd:concept5
...
</verbatim>
We could then check if any of the 4 concepts is being mapped by the provider.
If not, bad luck, we have to raise an error again.
But if not, the algorithm could check if the mapped tables used in this rep region are already containing the table used for the local abcd:concept1 mapping.
If the same table is already used in the rep region of concept2, then we should use B to iterate over the db-resultset.
If our table1 of concept1 is not used in the rep region of concept2, then we should rather use C for looping.
In the PyWrapper code I used to call the rep nodes that are used for looping "active rep nodes".
So analyzing and comparing rep regions from the custom structure with the ones defined in the conceptual schemas could help us decide which rep node to activate if the algorithm faces secondary emtpy rep regions.
For conceptual schemas like the current Darwin Core which does not define any repeatable regions (cardinality) there would be no such mechanism though. But we dont think with darwin core empty repeatable region problems will emerge anyway very often. And hopefully the future Darwin Core standard could at least group some of its concepts...
@
1.22
log
@Revision 22
@
text
@d203 3
@
1.21
log
@Revision 21
@
text
@d1 1
a1 1
see [[#empty][below]] for the empty repeatable region problem
a257 1
@
1.20
log
@Revision 20
@
text
@d204 1
a204 1
So if we face a request containing an EmptyRepRegion in the first place (before evaluating the conceptual bindings and their underlying local mappings) this is clearly an error and can be rejected.
@
1.19
log
@Revision 19
@
text
@d1 1
a1 1
see [[#empty][below]]for the empty repeatable region problem
@
1.18
log
@Revision 18
@
text
@d1 1
@
1.17
log
@Revision 17
@
text
@d120 1
@
1.16
log
@Revision 16
@
text
@d253 1
a253 1
So analyzing and comparing rep regions from the custom structure with the ones defined in the conceptual schemas could help us decide which rep node to activate if we face secondary emtpy rep regions.
@
1.15
log
@Revision 15
@
text
@d201 1
d204 1
d218 1
d220 1
a220 1
So instead of attaching a list of concepts to the branch node, it would be enough to assign it to the appropiate rep region of a known standard. But as a "primary empty rep region" in a request is an error, each rep region already has at least one concept assigned to it. And this can also be used to identify a rep region I think. SO we would not need any further domain/concept assignments to branch nodes, but rather analyze the rep regions referred to by the normal conceptual binding.
d232 1
a232 1
if abcd:concept2 is not mapped by a provider, we take a look at the rep region of that concept in the original standard:
d234 1
a234 1
RepRegion for abcd:concept2 with abcd:concept3 being the repeatable element:
d236 1
d238 1
a238 1
abcd:concept3
d248 1
a248 5
But if not, I see 2 possibilities:
a) use the table of those concepts to create loops:
As they dont show up in the result data, this would create redundant XML and is not what we want.
b) check if the tables used in this rep region are already containing the table used for abcd:concept1.
d252 2
a253 2
I used to call the rep nodes that are used for looping "active rep nodes".
So analyzing rep regions with each other could help us decide which rep node to activate if we face secondary emtpy rep regions.
d255 1
a256 2
There would be no such mechanism for darwin core though, as there are no hierarchical relations defined.
But I dont think with darwin core you will face such problems anyway very often.
@
1.14
log
@Revision 14
@
text
@d196 1
a196 1
In this case B and E form a repeatable region. If B is repeated, E is also repeated.
d198 1
a198 1
And that is the problem. We have no idea about the meaning of B or C and the results returned might make no sense.
d201 1
a201 1
So if we face a request containing an empty rep region in the first place (before evaluating the conceptual binding) this is clearly an error and can be rejected.
d213 2
a214 2
But then one of the provider doesnt have the dwc:concept2 mapped.
What should we do? We cannot raise a fatal error, because then we cannot retrieve any complex data (only core items everybody has)
a216 2
Now you considered to optionally attach a "domain" to branch nodes. I think I didnt really get the idea of a domain. Is it any different from just assigning another close concept to it? But then what happens if this is also not mapped? We would need a list of all concepts that specify the domain. And that actually exists if we analyze the rep regions of structured standards like ABCD.
@
1.13
log
@Revision 13
@
text
@d187 1
a187 1
The main problem is when we face a structure definition, analyze the concept-bindings and local database mappings and then are confronted with an "empty repeatable regions":
@
1.12
log
@Revision 12
@
text
@d167 1
a167 1
If we have a single table with each record representing a specimen, containing only one identification as a single field, the algorithm would interpret the different names of each record as multiple identifications for a single specimen. But remeber that this is only a problem because there is an empty RepeatableRegion "Specimen". If we add for example the catalog number to the structure, we would not face this problem anymore (unless a provider doesnt have a catalog number):
@
1.11
log
@Revision 11
@
text
@d120 1
a120 1
---+++++ Problems
d167 9
a175 1
If we have a single table with specimens data represented records, containing only one identification as a single field, the algorithm would interpret the different names of each record as multiple identifications for a single specimen.
d180 80
a259 2
In the case of the CustomSearch in the protocol, this cannot be defined as we are dealing with different providers with different database structures.
We hope to find a solution in the analysis of a single provider configuration, when also binding a concept to a RepNode without specifying any datatype, thus only giving an indication about how to treat this RepNode.
@
1.10
log
@Revision 10
@
text
@d22 1
a22 1
For a given provider this structure is mapped to database attributes as follows:
d28 3
a30 3
D t1.attrA
E t2.attrB
F t1.attrB
d32 4
a35 4
@@H t3.attrA
@@J t4.attrA
@@K t1.attrC
M t3.attrB
d78 1
a78 1
The tables envolved are as follows:
d121 1
a121 1
The only known problem so far is occuring when there is an empty rep-region above another rep region, that contains mappings.
d129 1
a129 1
D t1.attrA
d132 1
a132 1
In this case the algorithm suggests to use table t1 to loop at RepNode C to creeate multiple D elements:
d144 1
a144 1
But it might be that this a wrong interpretation and actually RepNode B should have been looped over:
d167 1
a167 1
If we have a single table with specimens data represented as 1 record, containing only one identification and no other data, the algorithm would interpret the different names of each record as multiple identifications for a single specimen.
d169 2
a170 2
On provider configuration level in BioCASE it is possible to manually set a "dont repeat" flag for a RepNode, that does not make any sense for the specific providers database structure.
In the example above one could tell the wrapper to not consider "Identification" as a RepNode.
d173 1
a173 2
We hope to find a solution in the analysis of a single providers configuration, when also binding a concept to e RepNode, while not specifying any datatype.
Thus only giving an indication how to treat this RepNode.
@
1.9
log
@Revision 9
@
text
@d40 1
a40 1
Determine for each repeatable element or "RepNode" the "locked" tables to loop over using their (compound) primary keys.
@
1.8
log
@Revision 8
@
text
@d40 1
a40 1
Determine for each repeatable element or "repNode" the "locked" tables to loop over using their (compound) primary keys.
d43 1
a43 1
<br>For each "RepeatableRegion" list all mapped tables that haven't been locked by a higher (parental) repNode.
d45 1
a45 1
A RepeatableRegion are all elements between 2 repNodes including the top repNodes' mapped tables. The root region is the root element and all child elements until the first encountered repNode.
d132 1
a132 1
In this case the algorithm suggests to use table t1 to loop at repNode C to creeate multiple D elements:
d144 1
a144 1
But it might be that this a wrong interpretation and actually repNode B should have been looped over:
d169 2
a170 2
On provider configuration level in BioCASE it is possible to manually set a "dont repeat" flag for a repNode, that does not make any sense for the specific providers database structure.
In the example above one could tell the wrapper to not consider "Identification" as a repNode.
d173 2
a174 2
We hope to find a solution in the analysis of a single providers configuration, when also binding a concept to e repNode, while not specifying any datatype.
Thus only giving an indication how to treat this repNode.
@
1.7
log
@Revision 7
@
text
@d4 1
a4 1
Internally the PyWrapper uses [[http://www.biocase.org/dev/wrapper/CMF.shtml][CMF]] files with a very similar structure as the ResponseStructure definition to represent different ConceptualSchema s.
@
1.6
log
@Revision 6
@
text
@d4 1
a4 1
Internally the PyWrapper uses CMF files with a very similar structure as the ResponseStructure definition to represent different ConceptualSchema s.
@
1.5
log
@Revision 5
@
text
@d119 56
@
1.4
log
@Revision 4
@
text
@d115 4
@
1.3
log
@Revision 3
@
text
@d7 1
a7 1
Imagine an example xml structure like this, with* marking repeatable elements:
d37 78
@
1.2
log
@Revision 2
@
text
@d33 1
a33 1
@@J t4.attrA
d35 1
a35 1
M t3.attrB
@
1.1
log
@Initial revision
@
text
@d28 3
a30 3
D t1.attrA
E t2.attrB
F t1.attrB
a36 1
@