root/bbox/DEVELOPMENT

Revision 571, 20.0 kB (checked in by zool, 2 years ago)

wrote a lot of acceptance tests for bbox. moved spatial indexing out of core.
providing a lookup table for feedparser shorthand for RSS namespaces that map to functions.


Line 
1 :Mon Dec 13 14:54:09 GMT 2004
2
3 From Matt Webb:
4
5 >here's my scenario, in which the system i'm building interacts with a black
6 >box, X: i ask X, please subscribe to these syndication feeds, please get
7 >anything on del.icio.us and Flickr tagged with "foo" [1]. i wait for a few
8 >days. i then use the bloglines API to pull out weblog entries, and some api or
9 >another to pull out the tagged information, and maybe another to do a search
10 >across the whole datastore for a URL [in the feed text] or keywords. X has gone
11 >away and looked after fetching and storing feeds, fixing rss 0.91, and throwing
12 >errors for 404'd feeds.
13
14 :Tue Dec 14 15:00:23 GMT 2004
15
16 http://www-106.ibm.com/developerworks/xml/library/x-rdfprov.html is edd's article on tracking rss provenance etc with redland contexts. this will be a useful approach, espec for ensuring that feeds are hosted on the same domains they're talking about, for events in the future. we may even be able to subclass edd's aggregator package as-is, then provide simple gateways for other feed formats in and out.
17
18 need to make sure that epistomat either has sensible support for contexts, or that we can provide it in a non-gnarly way. we can probably also augment fraggle with our more pleasant syntax for uris.
19
20 looking at edd's code as it stands, it's very low-level, full of workarounds for things that have since been fixed in the redland python API; a place to start, though... it mentions TODO: recording last-modified and using if-modified-since: we need to get that working with urllib2. http://www.btree.net/python/http_web_services/etags.html runs through this process.
21
22 :Wed Dec 15 17:02:44 GMT 2004
23
24 http://sourceforge.net/projects/feedparser/
25
26 is mark pilgrims last-ditch rss parser thing. i'd be happiest, i suppose if it did straight transformation of any feed format into rss1. let's see...
27
28 happily it seems to have good handling for last-modified and etag based requests; i only have to receieve and send the right headers from the store. it doesn't seem to do transformation, just build data structures from common feed elements and provide a nice interface for accessing properties...
29
30 having a look at the state of the redland store after running edd's decmo, it holds a model like this:
31
32 {(r1103038973r1), [http://www.w3.org/1999/02/22-rdf-syntax-ns#_8], [http://sippey.com/archives/000757.php]} {{{[http://usefulinc.com/fraggie/fetch/1]}}}
33 {(r1103038973r1), [http://www.w3.org/1999/02/22-rdf-syntax-ns#_9], [http://www.scottandrew.com/main/2003_07#a000695]} {{{[http://usefulinc.com/fraggie/fetch/1]
34
35 this suggests we should keep an incrementing counter per feed, as well as a counter per fetch of it,to keep these numbers in a serial order? we shouldn't worry about it too much as most of the output to queries will be lists of things constructted in date order. so what is the point of storing the sequentiality of items at all? we could plan for this but not bother in the first iteration, where all we need is a statement item -> partof -> channel.
36
37 :Wed Dec 22 16:01:35 GMT 2004
38
39 i am starting to sketch out code and made a distribution here, which includes the epistomat source and that of mark pilgrim's feedparser.  i got distracted by this article of his which was hevaily linked to on the foaf wiki; the scutter vocab material there, turned out to be not much use.
40
41 This <a href"http://diveintomark.org/archives/2003/07/21/atom_aggregator_behavior_http_level">mark pilgrim article about feed aggregation behaviour</a> looks like a good read, anyway.
42
43 :Tue Jan 11 07:49:03 IST 2005
44
45 eek, it's been a while. ongoing notes:
46
47
48 import httpserver
49
50 bloglines API for retrieval
51
52 feed mgmt - model, collections, collection instances
53
54 learning from past response rate - an urgency parameter which is calculated from the mean time between changes.
55
56 http://frot.org/2005/bbox/
57
58 bbox:Feed
59         bbox:source
60                 rss:channel
61
62         bbox:last_status
63                 200/403/etc
64         bbox:last_etag
65                 foo010101
66         bbox:last_modified
67                 20059020213
68         bbox:schedule
69                 (hours 1-24 between fetches?)
70
71 bbox:Visit
72         ical:datetime
73                 2005etc
74         bbox:status
75                 200/500/etc
76
77
78 each item is tagged with a visit as context
79         resolving multiples on the way out?
80
81 special rules:
82         if 404 - check 5 previous fetches - if all 404 suspend
83         if 301 - follow, make note     
84         if 302 - follow, change bbox:source
85         if 410, switch off forever
86         - other statuses embedded in feedparser?
87
88 parse gives us a dict oriented model
89         we just use timestamped items and don't use the _1, _2 etc model?
90         as this will confuse us between different sources
91
92         d.etag, d.modified, d.status, d.feed.has_key('foo')
93        
94         there is dc:creator support; we should patch to include foaf:maker, and always use a foaf model for creator details.
95
96 :Tue Feb 22 17:55:33 GMT 2005
97
98 long lag, in which i've spent a couple of hours making things compile and bashing on the epistomat. to the extent that feedreader hooks up, read different formats, collapses into a model which has contexts.
99
100 made a simple http server for the bloglines interface, and now i'm wondering about user accounts. presumably we need them; i had half-envisioned one bbox for one collection of feeds.
101
102 options
103 - make a bbox which doesn't know about user accounts, to test out and use for single-purpose installations (e.g, to crawl spatial info for wirelesslondon, and just have wirelesslondon talk to it)
104 - make a bbox which has user accounts, have a stub or generic one for single-purpose uses. don't worry about user management, but have some kind of HTTP basic auth for transactions.
105
106 case b is probably better, as it won't be much harder to do, will allow us to build-in the right funcitonality straight away, and we can always have an 'all' mode superuser which can't "mark as read" which emulates case a, if that seems necessary.
107
108 user-mode is not for collection of feeds, but it is for 'reading' them NNTP style and also for managing a subscription list, foaf-wise.
109
110 management etc can be done via the HTTP representation, The Sync API doesn't let you add subscriptions through it, so we need to create that.
111
112 we also need to have a new component; a crawler module, that manages getting updates and http status comprehension and timing of future actions; the model in the bbox already handles that stuff, the practicalities of etags etc all supplied by feedparser, which is pretty cool.
113
114 we should probably think pretty seriously about moving to twisted, though; let's look at the docs and compare to a gang of cron jobs / dodgy daemons...
115
116 :Sun Mar  6 16:15:14 GMT 2005
117
118 keep thinking about this again the the context of wirelesslondon / as a grout replacement. does what grout does for WL, with a more specialised and thought-out machine interface. has optional 'spatial extensions, basically, which are stored in PostGIS, often for mapserver's benefit, with references to URIs that are members in a Redland store.
119
120 :Tue Mar 15 17:01:08 GMT 2005
121
122 done a fair bit of work on the underlying 'framework' or what have you. The upgraded rdf-object wrapper is almost debugged and dusted. This has been largely for the benefit of other applications, for wirelesslondon and the consume nodedb.
123
124 in that context i've also been having quite lovely experiences with Quixote, and can now see no reason to build http apps any other way. it can slot into twisted or fastcgi or what have you, easily.
125
126 i made a nice home page for bbox: http://frot.org/bbox/ and hope to get a public svn or cvs repository together just as soon as the tests pass. (tests!)
127
128
129 :Fri Mar 18 23:34:37 GMT 2005
130
131 Flush with "getting things done", i made a simple quixote ui stub for bbox, and started emulating bloglines API functions. I'll stick this stuff in CVS now. Doesnt' do much yet, not far off. A simple temporal query outlined, a spatial boundign box (with different projections, at  least wgs84 and utm zone N...?) should come next.
132
133 In theory redland supports RDQL and simialr query languages. The question is mappign the column-table, variable-has-value results you get back from the RDF query, into the graph which makes statements that you'd like to complete. Thsi isn't such a big deal short term.  it will enable more inteersting, foafy sort of things, in the future...
134
135 :Tues Mar 22 20:22:00 GMT 2005
136
137 finally we sat down and fixed the rdfobj wrapper layer. i put a copy of it in here, involved setting PYTHONPATH to include the rdfobj directory.
138
139 So this has facilitated a lot of stuff. Feeds download and are stored in the RDF model, but the clean etag/modified handling advertised by feedparser isn't seamless :/
140
141 i should open up the rdf import too. i wanted to check this in before i broke anything, though.
142
143 :Fri Mar 25 13:08:25 GMT 2005
144
145 I stole wholeheartedly from diveintopython.org a tactful http handler, which i'm using to pick at both rss and rdf feeds. I'm still having niggles serialising the context, but bbox is definitely ready to test now. (needs more tests written, too.)
146
147 The GIS handling which i'd tentatively inserted, i removed; there is a spatialStore object in the wirelesslondon code tree, which would do the job better and more cleanly, opening up to a standalon spatial index abstraction and remove the postgis dependency which is , well, kludgy.
148
149 next is to finish the http interface - bloglines - and figure out how best to do temporal searches; on a per-feed basis we can work around that, for now.
150
151 :Fri Apr 22 02:48:39 BST 2005
152
153 I realise i should have a lot of time and energy to devote to bbox at the moment, and am flailing a little faced with the code, looking at different applications.
154
155 I should do a source release, which would help. i should also add a crawler and collector component to wirelesslondon; to init from openguides and then pick up the recent changes RSS. That would be useful, but wouldn't help with the implications of bbox as a bigger bit of software.
156
157 I've been holding out for interfaces like the ontomatic, because that does potentially really liberate me from the need to hack on cheesy web applications, much if at all.
158
159 Experimenting with drupal and its RSS aggregator enlightened me as to the need for a monitor-feed-index. perhaps just an RSS bot that i could ask for status, for now.
160
161
162 :Fri Apr 22 10:08:19 BST 2005
163
164 a simple way of doing user and feed management, basically. i wanted to allow people to hook in, or at elast model their own userdb. we have a lot of this code in wirelesslondon; needs plugged in to a simple deliciouslike API. we may as well bung a few template widgets for HTML into our handler for now, then abstract 'em out into the ontomatic later. o, and started an irc bot to do something like reporting, so i can ponder over monitoring functions. The idea is that the information about the latter should drop out of the model; if adequate info isn't contained in it, something is mildly wrong.
165
166 i just stole all the user account creation code from wl.user and dropped it into bbox and bbox.ui. This is defeinitely provoking me to wonder if i'm writing the same application. but i need to spike out of stasis at the moment.
167
168 :Sun Oct  9 10:28:49 BST 2005
169
170 Good lord, i've been slack with this process.
171
172 BBox changed a bit while i was writing nodel; now it only stores and queries geometry in wgs84, this seemed unnesc complex to be reprojecting. nodel uses bbox a lot, and there have been many small bugfixes to bbox in the process.
173
174 After i talked to Benoit Gregoire about it, i realised it should store full geometries for all types, there was only stub support for lines and polygons. i am adding that now, supporting a simple RSS serialisation like Mikel's one
175 at http://brainoff.com/worldkit/doc/polygon.php . As spatial queries for bounding boxes were already being done by making a POLYGON and asking for stuff Within() it, this looks simple; the tests already pass; but now i have to go back through, fix the existing interfaces in bbox and get those passing again.
176
177 Then we need a plan for finding data. I know where there is a lot of data nearby me. in the past, i've collected it mostly using scripts - complete mirrors of the 'open guide to london', that kind of thing. now it really needs to be on an aggregation schedule.
178
179 If we're going to make a nodel UI for bbox then we might as well make a very simple feed-status-manager as well, just a browsable view on fbox.Feed class objects.
180
181 but a lot of aggregation events should actually be described by more codelike rules, and they are handled through nodel's API to different services which is much more sophis. than bbox's model of get feed, look for spatial stuff, remember it all.
182
183 i would say a lot of this for now can be driven by a script on the cron that is explorign the model - get me all tags which an event is tagged with and look at the flickr feed for updates, and so on... get me everything from EVNT from different peoples changes and inboxes...
184
185 :Mon Oct 30 20:51:14 GMT 2006
186
187 It's been a long time.
188
189 I'm digging this codebase out because:
190
191 - Jamie King was asking about it
192 - Saul keeps mentioning it in the context of rebooting wirelesslondon
193 - there is a remote possibility that i might get paid for it
194 - mapufacture, bless their cotton socks, have no real incentive to release other than goodwill, they need a structure around them.
195
196 Now i could be contributing my time to egging on mapufacture as i could be to owslib as well. But i'm reminded that bbox is not far off finished. That it did work pretty well just had bad performance problems on serialisation, trying to haul around too many bulky and interconnected python objects at once.
197
198 When i started thinking about a WFS-basic implmenetation my first thought is that would belong here. Also if one were thinking about writing a prototype video metadata aggregator - as i assume Jamie is though i thought they were stuck into prototyping right now, and i know some other people are working on a drupal based solution - though this looks more like drawing, socialising and planning for a big sprint in the spring. But by then they (the transmission.cc people) need something that they can be learning from issues with and using to demonstrate proof of value for their contributing participants.
199
200 One issue Jan [sic?] had lamented was the lack of extensibility of the aggregators supplied with drupal. BBox as is, is pretty much the same - it collects a common core of properties well known to feedparser, plus geo:lat and geo:long - feedparser at least is catholic about what it extracts, as long as it's easy to configure what should be learned, this shouldn't be hard to change and will be useful. (i wonder how it handles a lot of atom extensions? - we'll also have to look for updates).
201
202 BBox is totally meant to be light footprint and i see the dependency on nodel crept into it for its http interfaces. This shouldn't have to be the case now - nodel though lovely was an overgrowth - can be replaced with the web.py currently in the geometa codebase.
203
204 How does this one connect to that - both are doing the broker/decorator thing - only the other has a very specific schema. A WFS interface could be appropriate for both though geometa only has envisaged, not implemented support for individual vector features.
205
206 We should a/ work from the data - find a good collection of features that we need to treat of and work from there
207 b/ figure out one directed thing that we can do and finish and that others will see benefit in, whether that is simplifying and extending bbox or extending and rethinking geometa. WFS-basic is super appealing though i am less sure how to implement the equivalent of OWSCat over it. This would be simple and impressive to do. I would not mind restricting this so that the data or at least an index of it has to be in PostGIS. One could index all the shapes in a shapefile as long as one had some way of referring consistently to the originals. But this swiftly starts to get into the domain of annotation system - a problem which looks the same as attaching potentially arbitrary properties and accompanying values to features and collections of them. This is why i keep thinking about bbox, because the arbitrariness is what the rdf store is for. With wfs-basic we don't need to mess around with geoserver and the allocation of URIs any more, and we get the facility of DescribeFeatureType to abuse how we like.
208
209
210 :Sat Nov  4 03:14:17 GMT 2006
211
212 Property extensibility crossed my mind briefly while looking back through __init__.py and i see a long rush of stuff saying "if e.has_key('geo_lat')" etc etc. We definitely have to fix this. This is even worse as it occurs conditionally according to whether or not one has enabled the spatial index. I'm thinking about making the spatial index mandatory.
213
214 Part of this is because there's no date range query support here yet. I really thought there was; i think it dropped out over iterations when spatial query became all important. One can ask for N recent() things but that's based on date collected, not date emitted with the data. We look at the latter and store it in the RDF store in iCal format. Then the object going into the spatial store is decoupled, as above only if it's enabled. Without it, we can't do date range queries without resorting all the way to SPARQL. Goodness knows rdfobj should have an interface through to SPARQL in redland, which just wasn't stable back when Schuyler and i wrote it. (SPARQL has dateTime-less-than and dateTime-greater-than predicates, the syntax is messy and right at this minute i don't want to go there - i just want to get the baseline of WFS Simple implemented, no matter how nasty it looks inside for now, and worry later.
215
216 So basically i am adding a 'dated' datetime field to the index  and that'll have to work just how within_box works now - construct the sql, run it, get a list of node identifiers back.
217 We should be able to pass date range limits into within_box or within_shape. Plus we want to be able to do date range queries without a box, for new. This just goes in the spatialStore.py module for now. Because that's an accreted mess which needs rewritten before any Shiny New Release, anyway.
218
219 :Sun Nov  5 07:31:12 GMT 2006
220
221 Right now i want to get to demo fastest, so I hooked this up to wirelesslondon's old database and copied the created dates to dated dates. there are 4.5K things in it which GetFeature by default won't deal very well with.
222
223 :Thu Nov 23 13:42:14 GMT 2006
224
225 WFS Simple was all very nice, and now we are going back and thinking about refactoring.
226
227 1/ bbox/lookup.py contains a pile of lookup functions and a dict mapping them to properties that come out of extended namespaces in feedparser, the idea being that it's easy to extend there what the parser puts into the RDF model.
228
229 2/ As an artefact of adding doctests to this we went back and fixed rdfobj so it looks for new namespaces to add as module globals after every load(), which is a cheap way of doing it but works and gets rid of a nasty bug we had for a long time (e.g. having to quit the store in the console after a namespace load, or having to run all the tests twice, first time for bootstrap). Which is great.
230
231 3/ But now the callback either wants the spatialStore or we do it another way. Actually it would be great to get the direct spatial stuff out of bbox and provide like a post-hook for objects; so when an object gets past the feedparser or the rdf parser a post-process (like spatial indexing) or even a number of them can get run. So we might want to generalise this. Best way to do it?
232
233 4/ While looking at spatialStore's interface we see it deals in strings and not really in rdf objects (though some newer methods have to do that) and it should probably be the latter on the outside interface - simpler. One could e.g. stick a call to an encoding-sniffer web service in here and insert the metadata one gets back in the object (it doesn't matter if we are waiting, right, there is not often going to be a client sitting around looking at this process)
234
235 I'm happy with this, extensions + some solid refactoring.
236
237 Thanks, Tav, for slapping my python up.
238
239 So:
240 - BBox gets started with a bunch of indexes (remember it had spatial, text etc as separate inputs before
241 - at the end of read_rdf / read_rss we loop through the indexes calling a add_to_index method which we must provide on the index's interface, pushing each item in at a time. Then *all* the spatially-specific code can move right out of bbox.Yeah!
242
243 Ugh, i found a lot of nodel-like authentication code right inside bbox/__init__; it has to leave there.
Note: See TracBrowser for help on using the browser.