Changeset 153

Show
Ignore:
Timestamp:
05/20/08 21:55:21 (7 months ago)
Author:
rgrp
Message:

[shksprdata][xl]: move to storing shakespeare texts in subversion (rather than just downloading and extracting).

  • shksprdata/texts: all the gutenberg texts (no moby for time being) + a metadata file listing info (for loading into db on a clean install).
  • shakespeare/model/dm.py, shakespeare/tests/test_model.py: get_store_filobj method to load corresponding text from the store
  • trunk/contrib/move_texts.py: helper script to move data from current cache+db setup to files in shksprdata.
  • CHANGELOG.txt: update with stuff for v0.5
  • MANIFEST.in: include material in shksprdata
Files:

Legend:

Unmodified
Added
Removed
Modified
Copied
Moved
  • trunk/CHANGELOG.txt

    Revision 127 Revision 153
      1v0.5: 2008-05-10 
      2================ 
      3 
      4  * Move to Pylons and rework web interface 
      5  * Move command line interface to use pastescript 
      6  * Now have Milton in addition to Shakespeare 
      7  * Store copies of texts in package (shksprdata) rather than downloading. 
      8 
    1v0.4: 2007-04-16 9v0.4: 2007-04-16 
    2================ 10================ 
    3 11 
    4  * Annotation of texts (js-based in browser) (ticket:20, ticket:21) 12  * Annotation of texts (js-based in browser) (ticket:20, ticket:21) 
    5    (<http://www.openshakespeare.org/2007/04/10/annotation-is-working/>) 13    (<http://www.openshakespeare.org/2007/04/10/annotation-is-working/>) 
    6  * Switch to unicode for internal string handling (resolves ticket:23: some 14  * Switch to unicode for internal string handling (resolves ticket:23: some 
    7    texts breaking the viewer) 15    texts breaking the viewer) 
    8  * Add functional tests for the web interface (ticket:11) 16  * Add functional tests for the web interface (ticket:11) 
    9  * Substantial improvements to speed of concordance (ticket:22) 17  * Substantial improvements to speed of concordance (ticket:22) 
    10    (<http://www.openshakespeare.org/2007/01/03/improvements-to-the-concordance/>) 18    (<http://www.openshakespeare.org/2007/01/03/improvements-to-the-concordance/>) 
    11  * Switch to genshi templates from kid 19  * Switch to genshi templates from kid 
    12  * Switch to plain WSGI from cherrypy 20  * Switch to plain WSGI from cherrypy 
    13 21 
    14Outstanding Issues 22Outstanding Issues 
    15------------------ 23------------------ 
    16 24 
    17  * Annotation cannot handle long texts because of javascript performance 25  * Annotation cannot handle long texts because of javascript performance 
    18    issues 26    issues 
    19 27 
    20 28 
    21v0.3: 2006-10-04 29v0.3: 2006-10-04 
    22================ 30================ 
    23 31 
    24  * Can now view mutiple texts side by side (ticket:15). See it in action at: 32  * Can now view mutiple texts side by side (ticket:15). See it in action at: 
    25    <http://demo.openshakespeare.org/view?name=othello_gut_f+othello_gut> 33    <http://demo.openshakespeare.org/view?name=othello_gut_f+othello_gut> 
    26  * Now include moby/bosak versions of shakespeare as well as gutenberg 34  * Now include moby/bosak versions of shakespeare as well as gutenberg 
    27    (ticket:10) (though more work remains to be done to process these versions 35    (ticket:10) (though more work remains to be done to process these versions 
    28    to plaintext and html) 36    to plaintext and html) 
    29  * Fix bug whereby we were missing some of the available gutenberg texts  37  * Fix bug whereby we were missing some of the available gutenberg texts  
    30    (ticket:18) 38    (ticket:18) 
    31  * Install the shakespeare python package (ticket:16) 39  * Install the shakespeare python package (ticket:16) 
    32  * Move to py.test from unittest 40  * Move to py.test from unittest 
    33  * New project website at <http://www.openshakespeare.org/> 41  * New project website at <http://www.openshakespeare.org/> 
    34 42 
    35Outstanding Issues 43Outstanding Issues 
    36------------------ 44------------------ 
    37 45 
    38  * Several of the source texts (all of them Gutenberg folios) seem to  46  * Several of the source texts (all of them Gutenberg folios) seem to  
    39    break the viewer due to kid (the templating system) complaining about about 47    break the viewer due to kid (the templating system) complaining about about 
    40    'not well-formed (invalid token) xml'. Any help in tracking this down would 48    'not well-formed (invalid token) xml'. Any help in tracking this down would 
    41    be greatly appreciated. 49    be greatly appreciated. 
    42 50 
    43 51 
    44v0.2 2006-07-16 52v0.2 2006-07-16 
    45=============== 53=============== 
    46 54 
    47  * Database backend with proper domain model (ticket:6) 55  * Database backend with proper domain model (ticket:6) 
    48  * Text snippets in concordance system and links through to source (ticket:12) 56  * Text snippets in concordance system and links through to source (ticket:12) 
    49  * Sources document (ticket:5) 57  * Sources document (ticket:5) 
  • trunk/MANIFEST.in

    Revision 148 Revision 153
    1recursive-include shakespeare/public * 1recursive-include shakespeare/public * 
    2recursive-include shakespeare/templates * 2recursive-include shakespeare/templates * 
      3recursive-include shksprdata 
  • trunk/shakespeare/model/dm.py

    Revision 150 Revision 153
    1""" 1""" 
    2Domain model 2Domain model 
    3 3 
    4Material contains all data we have including shakespeare texts. A text is taken 4Material contains all data we have including shakespeare texts. A text is taken 
    5to be a specific version of a work. e.g. the 1623 folio of King Richard III. 5to be a specific version of a work. e.g. the 1623 folio of King Richard III. 
    6 6 
    7We may in future add a Work object to refer to 'abstract' work of which a given 7We may in future add a Work object to refer to 'abstract' work of which a given 
    8text is a version. 8text is a version. 
    9""" 9""" 
    10import sqlobject 10import sqlobject 
    11 11 
    12# make sure config is registered 12# make sure config is registered 
    13import shakespeare 13import shakespeare 
    14shakespeare.conf() 14shakespeare.conf() 
    15 15 
    16from pylons.database import PackageHub 16from pylons.database import PackageHub 
    17hub = PackageHub('shakespeare') 17hub = PackageHub('shakespeare') 
    18sqlobject.sqlhub.processConnection = hub.getConnection() 18sqlobject.sqlhub.processConnection = hub.getConnection() 
    19 19 
    20import shakespeare 20import shakespeare 
    21import shakespeare.cache 21import shakespeare.cache 
    22 22 
    23# import other sqlobject items 23# import other sqlobject items 
    24from annotater.model import Annotation 24from annotater.model import Annotation 
    25import annotater.model 25import annotater.model 
    26 26 
    27# note we run this at bottom of module to auto create db tables on import 27# note we run this at bottom of module to auto create db tables on import 
    28def createdb(): 28def createdb(): 
    29    Material.createTable(ifNotExists=True) 29    Material.createTable(ifNotExists=True) 
    30    Concordance.createTable(ifNotExists=True) 30    Concordance.createTable(ifNotExists=True) 
    31    Statistic.createTable(ifNotExists=True) 31    Statistic.createTable(ifNotExists=True) 
    32    annotater.model.createdb() 32    annotater.model.createdb() 
    33 33 
    34def cleandb(): 34def cleandb(): 
    35    Statistic.dropTable(ifExists=True) 35    Statistic.dropTable(ifExists=True) 
    36    Concordance.dropTable(ifExists=True) 36    Concordance.dropTable(ifExists=True) 
    37    Material.dropTable(ifExists=True) 37    Material.dropTable(ifExists=True) 
    38    annotater.model.cleandb() 38    annotater.model.cleandb() 
    39 39 
    40def rebuilddb(): 40def rebuilddb(): 
    41    cleandb() 41    cleandb() 
    42    createdb() 42    createdb() 
    43 43 
    44class Material(sqlobject.SQLObject): 44class Material(sqlobject.SQLObject): 
    45    """Material related to Shakespeare (usually text of works and ancillary 45    """Material related to Shakespeare (usually text of works and ancillary 
    46    matter such as introductions). 46    matter such as introductions). 
    47 47 
    48    NB: can not use 'text' as class name as it is an sql reserved word 48    NB: can not use 'text' as class name as it is an sql reserved word 
    49 49 
    50    @attribute name: a unique name identifying the material 50    @attribute name: a unique name identifying the material 
    51     51     
    52    TODO: mutiple creators ?? 52    TODO: mutiple creators ?? 
    53    """ 53    """ 
    54     54     
    55    name = sqlobject.StringCol(alternateID=True) 55    name = sqlobject.StringCol(alternateID=True) 
    56    title = sqlobject.StringCol(default=None, length=255) 56    title = sqlobject.StringCol(default=None, length=255) 
    57    # creator rather than author to fit with dublin core 57    # creator rather than author to fit with dublin core 
    58    creator = sqlobject.StringCol(default=None, length=255) 58    creator = sqlobject.StringCol(default=None, length=255) 
    59    url = sqlobject.StringCol(default=None, length=255) 59    url = sqlobject.StringCol(default=None, length=255) 
    60    notes = sqlobject.StringCol(default=None) 60    notes = sqlobject.StringCol(default=None) 
    61 61 
    62    def get_cache_path(self, format): 62    def get_cache_path(self, format): 
    63        """Get path within cache to data file associated with this material. 63        """Get path within cache to data file associated with this material. 
    64        @format: the version ('plain', original='' etc) 64        @format: the version ('plain', original='' etc) 
    65        """ 65        """ 
    66        return shakespeare.cache.default.path(self.url, format) 66        return shakespeare.cache.default.path(self.url, format) 
    67 67 
      68    def get_store_fileobj(self): 
      69        import pkg_resources 
      70        pkg = 'shksprdata' 
      71        # default to plain txt format (TODO: generalise this) 
      72        path = 'texts/%s.txt' % self.name 
      73        fileobj = pkg_resources.resource_stream(pkg, path) 
      74        return fileobj 
      75 
      76 
    68class Concordance(sqlobject.SQLObject): 77class Concordance(sqlobject.SQLObject): 
    69 78 
    70    text = sqlobject.ForeignKey('Material') 79    text = sqlobject.ForeignKey('Material') 
    71    word = sqlobject.StringCol(length=50) 80    word = sqlobject.StringCol(length=50) 
    72    line = sqlobject.IntCol() 81    line = sqlobject.IntCol() 
    73    char_index = sqlobject.IntCol() 82    char_index = sqlobject.IntCol() 
    74 83 
    75    word_index = sqlobject.DatabaseIndex('word') 84    word_index = sqlobject.DatabaseIndex('word') 
    76    text_index = sqlobject.DatabaseIndex('text') 85    text_index = sqlobject.DatabaseIndex('text') 
    77 86 
    78class Statistic(sqlobject.SQLObject): 87class Statistic(sqlobject.SQLObject): 
    79 88 
    80    text = sqlobject.ForeignKey('Material') 89    text = sqlobject.ForeignKey('Material') 
    81    word = sqlobject.StringCol(length=50) 90    word = sqlobject.StringCol(length=50) 
    82    occurrences = sqlobject.IntCol(default=1) 91    occurrences = sqlobject.IntCol(default=1) 
    83 92 
    84    word_index = sqlobject.DatabaseIndex('word') 93    word_index = sqlobject.DatabaseIndex('word') 
    85    text_index = sqlobject.DatabaseIndex('text') 94    text_index = sqlobject.DatabaseIndex('text') 
    86 95 
    87 96 
    88# auto create db tables on import 97# auto create db tables on import 
    89createdb() 98createdb() 
    90 99 
  • trunk/shakespeare/tests/test_model.py

    Revision 150 Revision 153
    1import sqlobject 1import sqlobject 
    2 2 
    3import shakespeare.model as model 3import shakespeare.model as model 
    4 4 
    5class TestMaterial(object): 5class TestMaterial(object): 
    6 6 
    7    @classmethod 7    @classmethod 
    8    def setup_class(self): 8    def setup_class(self): 
    9        self.name = 'test-123' 9        self.name = 'test-123' 
    10        self.title = 'Hamlet' 10        self.title = 'Hamlet' 
    11        self.url = 'http://www.openshakespeare.org/blah.txt' 11        self.url = 'http://www.openshakespeare.org/blah.txt' 
    12        self.text = model.Material(name=self.name, 12        self.text = model.Material(name=self.name, 
    13                title=self.title, url=self.url) 13                title=self.title, url=self.url) 
    14 14 
    15    @classmethod 15    @classmethod 
    16    def teardown_class(self): 16    def teardown_class(self): 
    17        model.Material.delete(self.text.id) 17        model.Material.delete(self.text.id) 
    18     18     
    19    def test1(self): 19    def test1(self): 
    20        txtid = self.text.id 20        txtid = self.text.id 
    21        txt2 = model.Material.get(txtid) 21        txt2 = model.Material.get(txtid) 
    22        txt3 = model.Material.byName(self.name) 22        txt3 = model.Material.byName(self.name) 
    23        assert self.text.id == txt2.id 23        assert self.text.id == txt2.id 
    24        assert self.text.id == txt3.id 24        assert self.text.id == txt3.id 
    25     25     
    26    def test_get_cache_path(self): 26    def test_get_cache_path(self): 
    27        out = self.text.get_cache_path('plain') 27        out = self.text.get_cache_path('plain') 
    28        # do not want anything too specific or we end up duplicating cache_test 28        # do not want anything too specific or we end up duplicating cache_test 
    29        assert len(out) > 0 29        assert len(out) > 0 
      30 
      31    def test_get_store_fileobj(self): 
      32        text = model.Material.byName('phoenix_and_the_turtle_gut') 
      33        out = text.get_store_fileobj() 
      34        out = out.read() 
      35        assert len(out) > 0 
      36        assert out[:26] == 'THE PHOENIX AND THE TURTLE' 
      37 
    30 38 
    31class TestConcordance(object): 39class TestConcordance(object): 
    32 40 
    33    @classmethod 41    @classmethod 
    34    def setup_class(self): 42    def setup_class(self): 
    35        self.name = 'test-123' 43        self.name = 'test-123' 
    36        self.title = 'Hamlet' 44        self.title = 'Hamlet' 
    37        self.text = model.Material(name=self.name, title=self.title) 45        self.text = model.Material(name=self.name, title=self.title) 
    38        word = 'jones' 46        word = 'jones' 
    39        line = 20 47        line = 20 
    40        char_index = 500 48        char_index = 500 
    41        self.cc1 = model.Concordance(text=self.text, 49        self.cc1 = model.Concordance(text=self.text, 
    42                                         word=word, 50                                         word=word, 
    43                                         line=line, 51                                         line=line, 
    44                                         char_index=char_index) 52                                         char_index=char_index) 
    45 53 
    46    @classmethod 54    @classmethod 
    47    def teardown_class(self): 55    def teardown_class(self): 
    48        model.Concordance.delete(self.cc1.id) 56        model.Concordance.delete(self.cc1.id) 
    49        model.Material.delete(self.text.id) 57        model.Material.delete(self.text.id) 
    50 58 
    51    def test1(self): 59    def test1(self): 
    52        out1 = model.Concordance.get(self.cc1.id) 60        out1 = model.Concordance.get(self.cc1.id) 
    53        assert self.text == out1.text 61        assert self.text == out1.text 
    54 62 
    55class TestStatistic: 63class TestStatistic: 
    56 64 
    57    @classmethod 65    @classmethod 
    58    def setup_class(self): 66    def setup_class(self): 
    59        self.name = 'test-123' 67        self.name = 'test-123' 
    60        self.title = 'Hamlet' 68        self.title = 'Hamlet' 
    61        self.text = model.Material(name=self.name, title=self.title) 69        self.text = model.Material(name=self.name, title=self.title) 
    62        self.word = 'jones' 70        self.word = 'jones' 
    63        self.occurrences = 5 71        self.occurrences = 5 
    64        self.cc1 = model.Statistic( 72        self.cc1 = model.Statistic( 
    65                text=self.text, 73                text=self.text, 
    66                word=self.word, 74                word=self.word, 
    67                occurrences=self.occurrences 75                occurrences=self.occurrences 
    68                ) 76                ) 
    69 77 
    70    @classmethod 78    @classmethod 
    71    def teardown_class(self): 79    def teardown_class(self): 
    72        model.Statistic.delete(self.cc1.id) 80        model.Statistic.delete(self.cc1.id) 
    73        model.Material.delete(self.text.id) 81        model.Material.delete(self.text.id) 
    74 82 
    75    def test1(self): 83    def test1(self): 
    76        out1 = model.Statistic.get(self.cc1.id) 84        out1 = model.Statistic.get(self.cc1.id) 
    77        assert self.text == out1.text 85        assert self.text == out1.text 
    78        assert out1.occurrences == self.occurrences 86        assert out1.occurrences == self.occurrences 
    79 87 
    80    def test_select(self): 88    def test_select(self): 
    81        tresults  = model.Statistic.select( 89        tresults  = model.Statistic.select( 
    82            sqlobject.AND( 90            sqlobject.AND( 
    83                model.Statistic.q.textID == self.text.id, 91                model.Statistic.q.textID == self.text.id, 
    84                model.Statistic.q.word == self.word, 92                model.Statistic.q.word == self.word, 
    85                )) 93                )) 
    86        num = tresults.count() 94        num = tresults.count() 
    87        assert num == 1 95        assert num == 1 
    88 96