Changeset 154

Show
Ignore:
Timestamp:
05/21/08 01:56:27 (7 months ago)
Author:
rgrp
Message:

[misc][s]: change to retrieve texts from (shksprdata) package rather than from cache and create db init command to initialise database from metadata.txt.

  • model/dm.py:
    • rename get_store_fileobj to get_text and use everywhere instead of get_cache_path
    • load_from_metadata method to load Material entries from metadata.
  • cli.py (+ README.txt): db init etc.
Files:

Legend:

Unmodified
Added
Removed
Modified
Copied
Moved
  • trunk/README.txt

    Revision 150 Revision 154
    1Introduction 1Introduction 
    2************ 2************ 
    3 3 
    4The Open Shakespeare package provides a full open set of shakespeare's works 4The Open Shakespeare package provides a full open set of shakespeare's works 
    5(often in multiple versions) along with ancillary material, a variety of tools 5(often in multiple versions) along with ancillary material, a variety of tools 
    6and a python API. 6and a python API. 
    7 7 
    8Specifically in addition to the works themselves (often in multiple versions) 8Specifically in addition to the works themselves (often in multiple versions) 
    9there is an introduction, a chronology, explanatory notes, a concordance and 9there is an introduction, a chronology, explanatory notes, a concordance and 
    10search facilities. 10search facilities. 
    11 11 
    12All material is open source/open knowledge so that anyone can use, redistribute 12All material is open source/open knowledge so that anyone can use, redistribute 
    13and reuse these materials freely. For exact details of the license under which 13and reuse these materials freely. For exact details of the license under which 
    14this package is made available please see COPYING.txt. 14this package is made available please see COPYING.txt. 
    15 15 
    16Open Shakespeare has been developed under the aegis of the Open Knowledge 16Open Shakespeare has been developed under the aegis of the Open Knowledge 
    17Foundation (http://www.okfn.org/). 17Foundation (http://www.okfn.org/). 
    18 18 
    19Contact the Project 19Contact the Project 
    20******************* 20******************* 
    21 21 
    22Please mail info@okfn.org or join the okfn-discuss mailing list: 22Please mail info@okfn.org or join the okfn-discuss mailing list: 
    23 23 
    24  http://lists.okfn.org/listinfo/okfn-discuss 24  http://lists.okfn.org/listinfo/okfn-discuss 
    25 25 
    26 26 
    27Installation and Setup 27Installation and Setup 
    28********************** 28********************** 
    29 29 
    301. Install the code 301. Install the code 
    31=================== 31=================== 
    32 32 
    331.1: (EITHER) Install using setup.py (preferred) 331.1: (EITHER) Install using setup.py (preferred) 
    34------------------------------------------------ 34------------------------------------------------ 
    35 35 
    36Install ``shakespeare`` using easy_install:: 36Install ``shakespeare`` using easy_install:: 
    37 37 
    38    easy_install shakespeare 38    easy_install shakespeare 
    39 39 
    40NB: If you don't have easy_install you can get from here: 40NB: If you don't have easy_install you can get from here: 
    41 41 
    42<http://peak.telecommunity.com/DevCenter/EasyInstall#installation-instructions> 42<http://peak.telecommunity.com/DevCenter/EasyInstall#installation-instructions> 
    43 43 
    44Make a config file as follows:: 44Make a config file as follows:: 
    45 45 
    46    paster make-config shakespeare config.ini 46    paster make-config shakespeare config.ini 
    47 47 
    48Tweak the config file as appropriate and then setup the application:: 48Tweak the config file as appropriate and then setup the application:: 
    49 49 
    50    paster setup-app config.ini 50    paster setup-app config.ini 
    51 51 
    521.2 (OR) Get the code straight from subversion 521.2 (OR) Get the code straight from subversion 
    53------------------------------------------------ 53------------------------------------------------ 
    54 54 
    551. Check out the subversion trunk:: 551. Check out the subversion trunk:: 
    56 56 
    57    svn co https://knowledgeforge.net/shakespeare/svn/trunk 57    svn co https://knowledgeforge.net/shakespeare/svn/trunk 
    58 58 
    592. Do:: 592. Do:: 
    60 60 
    61    sudo python setup.py develop 61    sudo python setup.py develop 
    62 62 
    63 63 
    642. Cache Directory 642. Cache Directory 
    65================== 65================== 
    66 66 
    67Create a cache directory where texts and other material can be stored 67Create a cache directory where texts and other material can be stored 
    68 68 
    69This directory needs to be semi-permanent so do *not* put under a location such 69This directory needs to be semi-permanent so do *not* put under a location such 
    70as /tmp.  70as /tmp.  
    71 71 
    72 72 
    73 73 
    745. Initialize the system 745. Initialize the system 
    75======================== 75======================== 
    76 76 
    77Run: $ bin/shakespeare-admin init 77Run:: 
    78 78 
    79This may take some time to run so be patient  79     $ shakespeare-admin db create 
       80     $ shakespeare-admin db init 
    80 81 
    81TIP: using sqlite building the concordance really **does** seem to run forever  82 If you want to build the concordance do:: 
    82so recommend using postgresql or mysql if you are going to build the  83  
    83concordance.   84     $ shakespeare-admin concordance 
       85  
       86 NB: This may take some time to run so be patient. TIP: using sqlite building 
       87 the concordance really **does** seem to run forever so recommend using 
       88 postgresql or mysql if you are going to build the concordance.  
    84 89 
    85 90 
    86Getting Started 91Getting Started 
    87*************** 92*************** 
    88 93 
    89As a user: 94As a user: 
    90========== 95========== 
    91 96 
    92Start up the web interface by running the webserver: 97Start up the web interface by running the webserver:: 
    93 98 
    94  $ bin/shakespeare-admin runserver 99    $ paster serve {your-config.ini} 
    95 100 
    96Then visit http://localhost:8080/ using your favourite web browser.  101 NB: {your-config.ini} should be replaced with the name of the config file you 
       102 created earlier. 
       103  
    97 104 
    98As a developer: 105As a developer: 
    99=============== 106=============== 
    100 107 
    1010. Copy development.ini.tmpl to development.ini and edit to your taste. 1080. Copy development.ini.tmpl to development.ini and edit to your taste. 
    102 109 
    1031. Check out the administrative commands: $ bin/shakespeare-admin help. 1101. Check out the administrative commands: $ bin/shakespeare-admin help. 
    104 111 
    1052. Run the tests using either py.test of nosetests:: 1122. Run the tests using either py.test of nosetests:: 
    106 113 
    107    $ nosetests shakespeare 114    $ nosetests shakespeare 
    108 115 
  • trunk/contrib/size.py

    Revision 61 Revision 154
    1#!/usr/bin/env python 1#!/usr/bin/env python 
    2""" 2""" 
    3Print shakespeare plays and their sizes. 3Print shakespeare plays and their sizes. 
    4 4 
    5Use Gutenberg plain versions 5Use Gutenberg plain versions 
    6""" 6""" 
    7import shakespeare.index 7import shakespeare.index 
    8 8 
    9def count_words(fileobj): 9def count_words(fileobj): 
    10    """Count the number of words in a file.""" 10    """Count the number of words in a file.""" 
    11    count = 0  11    count = 0  
    12    for line in fileobj: 12    for line in fileobj: 
    13        words = line.split() 13        words = line.split() 
    14        count += len(words) 14        count += len(words) 
    15    return count 15    return count 
    16 16 
    17numtexts = 0 17numtexts = 0 
    18totalwords = 0 18totalwords = 0 
    19for text in shakespeare.index.all: 19for text in shakespeare.index.all: 
    20    # if you wanted the title it would be text.title 20    # if you wanted the title it would be text.title 
    21    name = text.name 21    name = text.name 
    22    # want gutenberg version but not folios 22    # want gutenberg version but not folios 
    23    # if you want to include folios remove the second condition 23    # if you want to include folios remove the second condition 
    24    if '_gut' in name and not '_gut_f' in name: 24    if '_gut' in name and not '_gut_f' in name: 
    25        numtexts += 1 25        numtexts += 1 
    26        fileobj = file(text.get_cache_path('plain')) 26        fileobj = file(text.get_text()) 
    27        numwords = count_words(fileobj) 27        numwords = count_words(fileobj) 
    28        print name.ljust(60), numwords 28        print name.ljust(60), numwords 
    29        totalwords += numwords 29        totalwords += numwords 
    30print '-------------------------' 30print '-------------------------' 
    31print 'Total: %s works, %s words' % (numtexts, totalwords) 31print 'Total: %s works, %s words' % (numtexts, totalwords) 
  • trunk/contrib/view_raw.py

    Revision 91 Revision 154
    1#!/usr/bin/env python 1#!/usr/bin/env python 
    2import sys 2import sys 
    3 3 
    4import shakespeare.dm 4import shakespeare.dm 
    5 5 
    6name = sys.argv[1] 6name = sys.argv[1] 
    7work = shakespeare.dm.Material.byName(name) 7work = shakespeare.dm.Material.byName(name) 
    8path = work.get_cache_path('plain'8path = work.get_text(
    9ff = file(path) 9ff = file(path) 
    10print path 10print path 
    11indata = unicode(ff.read(), 'utf-8') 11indata = unicode(ff.read(), 'utf-8') 
    12print indata.encode('utf-8') 12print indata.encode('utf-8') 
  • trunk/shakespeare/cli.py

    Revision 151 Revision 154
    1#!/usr/bin/env python 1#!/usr/bin/env python 
    2 2 
    3import cmd 3import cmd 
    4import os 4import os 
    5import StringIO 5import StringIO 
    6 6 
    7class ShakespeareAdmin(cmd.Cmd): 7class ShakespeareAdmin(cmd.Cmd): 
    8    """ 8    """ 
    9    TODO: self.verbose option and associated self._print 9    TODO: self.verbose option and associated self._print 
    10    """ 10    """ 
    11 11 
    12    prompt = 'The Bard > ' 12    prompt = 'The Bard > ' 
    13 13 
    14    def run_interactive(self, line=None): 14    def run_interactive(self, line=None): 
    15        """Run an interactive session. 15        """Run an interactive session. 
    16        """ 16        """ 
    17        print 'Welcome to shakespeare-admin interactive mode\n' 17        print 'Welcome to shakespeare-admin interactive mode\n' 
    18        self.do_about() 18        self.do_about() 
    19        print 'Type:  "?" or "help" for help on commands.\n' 19        print 'Type:  "?" or "help" for help on commands.\n' 
    20        while 1: 20        while 1: 
    21            try: 21            try: 
    22                self.cmdloop() 22                self.cmdloop() 
    23                break 23                break 
    24            except KeyboardInterrupt: 24            except KeyboardInterrupt: 
    25                raise 25                raise 
    26 26 
    27    def do_help(self, line=None): 27    def do_help(self, line=None): 
    28        cmd.Cmd.do_help(self, line) 28        cmd.Cmd.do_help(self, line) 
    29 29 
    30    def do_about(self, line=None): 30    def do_about(self, line=None): 
    31        import shakespeare 31        import shakespeare 
    32        version = shakespeare.__version__ 32        version = shakespeare.__version__ 
    33        about = \ 33        about = \ 
    34'''Open Shakespeare version %s. Copyright the Open Knowledge Foundation. 34'''Open Shakespeare version %s. Copyright the Open Knowledge Foundation. 
    35Open Shakespeare is open-knowledge and open-source. See COPYING for details. 35Open Shakespeare is open-knowledge and open-source. See COPYING for details. 
    36''' % version 36''' % version 
    37        print about 37        print about 
    38 38 
    39    def do_quit(self, line=None): 39    def do_quit(self, line=None): 
    40        sys.exit() 40        sys.exit() 
    41 41 
    42    def do_EOF(self, *args): 42    def do_EOF(self, *args): 
    43        print '' 43        print '' 
    44        sys.exit() 44        sys.exit() 
    45 45 
    46    # ================= 46    # ================= 
    47    # Commands 47    # Commands 
    48 48 
    49    def do_db(self, line=None): 49    def do_db(self, line=None): 
    50        actions = [ 'create', 'clean', 'rebuild'50        actions = [ 'create', 'clean', 'rebuild', 'init'
    51        if line is None or line not in actions: 51        if line is None or line not in actions: 
    52            self.help_db() 52            self.help_db() 
    53            return 1 53            return 1 
    54        import shakespeare.dm  54         import shakespeare.model 
    55        shakespeare.dm.__dict__[line+'db']()  55         if line == 'init': 
       56             import pkg_resources 
       57             pkg = 'shksprdata' 
       58             meta = pkg_resources.resource_stream(pkg, 'texts/metadata.txt') 
       59             shakespeare.model.Material.load_from_metadata(meta) 
       60         else: 
       61             shakespeare.model.__dict__[line+'db']() 
    56 62 
    57    def help_db(self, line=None): 63    def help_db(self, line=None): 
    58        usage = \ 64        usage = \ 
    59'''db <action> 65'''db { create | clean | rebuild | init } 
    60 66''' 
    61Where action is one of create, clean, rebuild.'''   
    62        print usage 67        print usage 
    63     68     
    64    def do_gutenberg(self, line=None): 69    def do_gutenberg(self, line=None): 
    65        import shakespeare.gutenberg 70        import shakespeare.gutenberg 
    66        helper = shakespeare.gutenberg.Helper(verbose=True) 71        helper = shakespeare.gutenberg.Helper(verbose=True) 
    67        if not line: 72        if not line: 
    68            helper.execute() 73            helper.execute() 
    69        elif line == 'print_index': 74        elif line == 'print_index': 
    70            import pprint 75            import pprint 
    71            pprint.pprint(helper.get_index()) 76            pprint.pprint(helper.get_index()) 
    72        else: 77        else: 
    73            msg = 'Unknown argument %s' % line 78            msg = 'Unknown argument %s' % line 
    74            raise Exception(msg) 79            raise Exception(msg) 
    75 80 
    76    def help_gutenberg(self, line=None): 81    def help_gutenberg(self, line=None): 
    77        usage = \ 82        usage = \ 
    78""" 83""" 
    79Download and process all Project Gutenberg shakespeare texts""" 84Download and process all Project Gutenberg shakespeare texts""" 
    80        print usage  85        print usage  
    81 86 
    82    def do_moby(self, line=None): 87    def do_moby(self, line=None): 
    83        import shakespeare.moby 88        import shakespeare.moby 
    84        helper = shakespeare.moby.Helper(verbose=True) 89        helper = shakespeare.moby.Helper(verbose=True) 
    85        if not line: 90        if not line: 
    86            helper.execute() 91            helper.execute() 
    87        elif line == 'print_index': 92        elif line == 'print_index': 
    88            import pprint 93            import pprint 
    89            pprint.pprint(helper.get_index()) 94            pprint.pprint(helper.get_index()) 
    90        else: 95        else: 
    91            msg = 'Unknown argument %s' % line 96            msg = 'Unknown argument %s' % line 
    92            raise Exception(msg) 97            raise Exception(msg) 
    93 98 
    94    def help_moby(self, line=None): 99    def help_moby(self, line=None): 
    95        usage = \ 100        usage = \ 
    96''' 101''' 
    97Download and process all Moby/Bosak shakespeare texts''' 102Download and process all Moby/Bosak shakespeare texts''' 
    98        print usage  103        print usage  
    99 104 
    100    def _init_index(self): 105    def _init_index(self): 
    101        import shakespeare.index 106        import shakespeare.index 
    102        self._index = shakespeare.index.all 107        self._index = shakespeare.index.all 
    103 108 
    104    def _filter_index(self, line): 109    def _filter_index(self, line): 
    105        """Filter items in index return only those whose id (url) is in line 110        """Filter items in index return only those whose id (url) is in line 
    106        If line is empty or None return all items 111        If line is empty or None return all items 
    107        """ 112        """ 
    108        if line: 113        if line: 
    109            textsToAdd = [] 114            textsToAdd = [] 
    110            textNames = line.split() 115            textNames = line.split() 
    111            for item in self._index: 116            for item in self._index: 
    112                if item.name in textNames: 117                if item.name in textNames: 
    113                    textsToAdd.append(item) 118                    textsToAdd.append(item) 
    114            return textsToAdd 119            return textsToAdd 
    115        else: 120        else: 
    116            self._init_index() 121            self._init_index() 
    117            return self._index 122            return self._index 
    118     123     
    119    def do_print_index(self, line): 124    def do_index(self, line): 
    120        self._init_index() 125        self._init_index() 
    121        header = \ 126        header = \ 
    122'''          +-------------------+ 127'''          +-------------------+ 
    123          | Index of Material | 128          | Index of Material | 
    124          +-------------------+ 129          +-------------------+ 
    125 130 
    126''' 131''' 
    127        print header 132        print header 
    128        for row in self._index: 133        for row in self._index: 
    129            print row.name.ljust(35), row.title 134            print row.name.ljust(35), row.title 
    130 135 
    131    def help_print_index(self, line=None): 136    def help_index(self, line=None): 
    132        usage = \ 137        usage = \ 
    133'''Print index of Shakespeare texts to stdout''' 138'''Print index of Shakespeare texts to stdout''' 
    134        print usage 139        print usage 
    135 140 
    136    def do_make_concordance(self, line=None): 141    def do_concordance(self, line=None): 
    137        self._init_index() 142        self._init_index() 
    138        print 'Making concordance (this may take some time ...):' 143        print 'Making concordance (this may take some time ...):' 
    139        from shakespeare.concordance import ConcordanceBuilder 144        from shakespeare.concordance import ConcordanceBuilder 
    140        import time 145        import time 
    141        start = end = 0 146        start = end = 0 
    142        start = time.time() 147        start = time.time() 
    143        cc = ConcordanceBuilder() 148        cc = ConcordanceBuilder() 
    144        textsToAdd = [] 149        textsToAdd = [] 
    145        if line is not None: 150        if line is not None: 
    146            textsToAdd = self._filter_index(line) 151            textsToAdd = self._filter_index(line) 
    147        else: 152        else: 
    148            def gut_non_folio(material): 153            def gut_non_folio(material): 
    149                return '_gut' in material.name and 'gut_f' not in material.name 154                return '_gut' in material.name and 'gut_f' not in material.name 
    150            textsToAdd = filter(gut_non_folio, self._index)  155            textsToAdd = filter(gut_non_folio, self._index)  
    151        for item in textsToAdd: 156        for item in textsToAdd: 
    152            print 'Adding: %s (%s)' % (item.name, item.title) 157            print 'Adding: %s (%s)' % (item.name, item.title) 
    153            cc.add_text(item.name) 158            cc.add_text(item.name) 
    154        end = time.time() 159        end = time.time() 
    155        timetaken = end - start 160        timetaken = end - start 
    156        print 'Finished. Time taken was %ss' % timetaken 161        print 'Finished. Time taken was %ss' % timetaken 
    157 162 
    158    def help_make_concordance(self, line=None): 163    def help_concordance(self, line=None): 
    159        usage = \ 164        usage = \ 
    160'''Create a concordance 165'''Create a concordance 
    161 166 
    162If no arguments supplied then use all non-folio gutenberg shakespeare texts. 167If no arguments supplied then use all non-folio gutenberg shakespeare texts. 
    163Otherwise arguments should be a space seperated list of work name ids 168Otherwise arguments should be a space seperated list of work name ids 
    164''' 169''' 
    165        print usage   
    166   
    167    def do_init(self, line=None):   
    168        self.do_gutenberg(line)   
    169        self.do_moby(line)   
    170        self.do_make_concordance(line)   
    171   
    172    def help_init(self, line=None):   
    173        usage = \   
    174'''Convenience function that sets everything up by running:   
    175    1. gutenberg   
    176    2. moby   
    177    3. make_concordance'''   
    178        print usage 170        print usage 
    179 171 
    180    def do_runserver(self, line=None): 172    def do_runserver(self, line=None): 
    181        self.help_runserver() 173        self.help_runserver() 
    182 174 
    183    def help_runserver(self, line=None): 175    def help_runserver(self, line=None): 
    184        usage = \ 176        usage = \ 
    185'''This command has been DEPRECATED. 177'''This command has been DEPRECATED. 
    186 178 
    187Please use `paster serve` to run a server now, e.g.:: 179Please use `paster serve` to run a server now, e.g.:: 
    188 180 
    189    paster serve <my-config.ini> 181    paster serve <my-config.ini> 
    190''' 182''' 
    191        print usage 183        print usage 
    192 184 
    193 185 
    194def main(): 186def main(): 
    195    import optparse 187    import optparse 
    196    usage = \ 188    usage = \ 
    197'''%prog [options] <command> 189'''%prog [options] <command> 
    198 190 
    199Run about or help for details.''' 191Run about or help for details.''' 
    200    parser = optparse.OptionParser(usage) 192    parser = optparse.OptionParser(usage) 
    201    parser.add_option('-v', '--verbose', dest='verbose', help='Be verbose', 193    parser.add_option('-v', '--verbose', dest='verbose', help='Be verbose', 
    202            action='store_true', default=False)  194            action='store_true', default=False)  
    203    options, args = parser.parse_args() 195    options, args = parser.parse_args() 
    204     196     
    205    if len(args) == 0: 197    if len(args) == 0: 
    206        parser.print_help() 198        parser.print_help() 
    207        return 1 199        return 1 
    208    else: 200    else: 
    209        cmd = ShakespeareAdmin() 201        cmd = ShakespeareAdmin() 
    210        args = ' '.join(args) 202        args = ' '.join(args) 
    211        args = args.replace('-','_') 203        args = args.replace('-','_') 
    212        cmd.onecmd(args) 204        cmd.onecmd(args) 
    213 205 
  • trunk/shakespeare/concordance.py

    Revision 150 Revision 154
    1""" 1""" 
    2Concordance (and statistics) for texts in database. 2Concordance (and statistics) for texts in database. 
    3 3 
    4To build concordance use ConcordanceBuilder.  To access concordance/statistics 4To build concordance use ConcordanceBuilder.  To access concordance/statistics 
    5use Concordance/Statistics class.  Concordance and statistics are provided as 5use Concordance/Statistics class.  Concordance and statistics are provided as 
    6dictionaries keyed by words. 6dictionaries keyed by words. 
    7 7 
    8NB: all word keys have been lower-cased in order to render them 8NB: all word keys have been lower-cased in order to render them 
    9case-insensitive 9case-insensitive 
    10""" 10""" 
    11import re 11import re 
    12 12 
    13import sqlobject 13import sqlobject 
    14 14 
    15import shakespeare.index 15import shakespeare.index 
    16import shakespeare.cache 16import shakespeare.cache 
    17 17 
    18 18 
    19class ConcordanceBase(object): 19class ConcordanceBase(object): 
    20    """ 20    """ 
    21    TODO: caching?? 21    TODO: caching?? 
    22    """ 22    """ 
    23    sqlcc = shakespeare.model.Concordance 23    sqlcc = shakespeare.model.Concordance 
    24    sqlstat = shakespeare.model.Statistic 24    sqlstat = shakespeare.model.Statistic 
    25 25 
    26    def __init__(self, filter_names=None): 26    def __init__(self, filter_names=None): 
    27        """ 27        """ 
    28        @param filter_names: a list of id names with which to filter results 28        @param filter_names: a list of id names with which to filter results 
    29            (i.e. only return results relating to those texts) 29            (i.e. only return results relating to those texts) 
    30        """ 30        """ 
    31        self._filter_names = filter_names 31        self._filter_names = filter_names 
    32        self.sqlcc_filter = self._make_filter(self.sqlcc) 32        self.sqlcc_filter = self._make_filter(self.sqlcc) 
    33        self.sqlstat_filter = self._make_filter(self.sqlstat) 33        self.sqlstat_filter = self._make_filter(self.sqlstat) 
    34 34 
    35    def _make_filter(self, sqlobj): 35    def _make_filter(self, sqlobj): 
    36        sql_filter = True 36        sql_filter = True 
    37        if self._filter_names is not None: 37        if self._filter_names is not None: 
    38            arglist = [] 38            arglist = [] 
    39            for name in self._filter_names: 39            for name in self._filter_names: 
    40                newarg = sqlobj.q.textID == self._name2id(name) 40                newarg = sqlobj.q.textID == self._name2id(name) 
    41                arglist.append(newarg) 41                arglist.append(newarg) 
    42            sql_filter = sqlobject.OR(*arglist) 42            sql_filter = sqlobject.OR(*arglist) 
    43        return sql_filter 43        return sql_filter 
    44     44     
    45    def _name2id(self, name): 45    def _name2id(self, name): 
    46        return shakespeare.model.Material.byName(name).id 46        return shakespeare.model.Material.byName(name).id 
    47 47 
    48    def keys(self): 48    def keys(self): 
    49        """Return list of *distinct* words in concordance/statistics 49        """Return list of *distinct* words in concordance/statistics 
    50        """ 50        """ 
    51        all = self.sqlstat.select(self.sqlstat_filter, 51        all = self.sqlstat.select(self.sqlstat_filter, 
    52                           orderBy=self.sqlstat.q.word, 52                           orderBy=self.sqlstat.q.word, 
    53                           ) 53                           ) 
    54        words = [ xx.word for xx in list(all) ] 54        words = [ xx.word for xx in list(all) ] 
    55        distinct = list(set(words)) 55        distinct = list(set(words)) 
    56        distinct.sort() 56        distinct.sort() 
    57        return distinct 57        return distinct 
    58 58 
    59 59 
    60class Concordance(ConcordanceBase): 60class Concordance(ConcordanceBase): 
    61    """Concordance by word for a set of texts 61    """Concordance by word for a set of texts 
    62    """ 62    """ 
    63 63 
    64    def get(self, word): 64    def get(self, word): 
    65        """Get list of occurrences for word 65        """Get list of occurrences for word 
    66        @return: sqlobject query list  66        @return: sqlobject query list  
    67        """ 67        """ 
    68        select = self.sqlcc.select(sqlobject.AND(self.sqlcc_filter, self.sqlcc.q.word==word)) 68        select = self.sqlcc.select(sqlobject.AND(self.sqlcc_filter, self.sqlcc.q.word==word)) 
    69        return select 69        return select 
    70 70 
    71class Statistics(ConcordanceBase): 71class Statistics(ConcordanceBase): 
    72 72 
    73    def get(self, word): 73    def get(self, word): 
    74        select = self.sqlstat.select( 74        select = self.sqlstat.select( 
    75            sqlobject.AND(self.sqlstat_filter, self.sqlstat.q.word==word) 75            sqlobject.AND(self.sqlstat_filter, self.sqlstat.q.word==word) 
    76            ) 76            ) 
    77        total = 0 77        total = 0 
    78        for stat in select: 78        for stat in select: 
    79            total += stat.occurrences 79            total += stat.occurrences 
    80        return total 80        return total 
    81 81 
    82class ConcordanceBuilder(object): 82class ConcordanceBuilder(object): 
    83    """Build a concordance and associated statistics for a set of texts. 83    """Build a concordance and associated statistics for a set of texts. 
    84     84     
    85    """ 85    """ 
    86 86 
    87    # multiline, unicode and ignorecase 87    # multiline, unicode and ignorecase 
    88    word_regex = re.compile(r'\b(\w+)\b', re.U | re.M | re.I) 88    word_regex = re.compile(r'\b(\w+)\b', re.U | re.M | re.I) 
    89 89 
    90    words_to_ignore = [  90    words_to_ignore = [  
    91        # 'a', 'the', 'and', 'as', 'are', 'be', 'but', 'in' 91        # 'a', 'the', 'and', 'as', 'are', 'be', 'but', 'in' 
    92                        ] 92                        ] 
    93    non_words = [  93    non_words = [  
    94            'd', # accus'd 94            'd', # accus'd 
    95            't', 95            't', 
    96            ] 96            ] 
    97 97 
    98    def is_roman_numeral(self, word): 98    def is_roman_numeral(self, word): 
    99        digits = [ 'i', 'ii', 'iii', 'iv', 'v', 'vi', 'vii', 'viii', 'ix' ] 99        digits = [ 'i', 'ii', 'iii', 'iv', 'v', 'vi', 'vii', 'viii', 'ix' ] 
    100        others = [ 'l', 'x', 'c' ] 100        others = [ 'l', 'x', 'c' ] 
    101        if word == 'i': return False # exception because this conflicts with I 101        if word == 'i': return False # exception because this conflicts with I 
    102        while word[0] in others: 102        while word[0] in others: 
    103            if len(word) == 1: 103            if len(word) == 1: 
    104                return True 104                return True 
    105            else: 105            else: 
    106                word = word[1:] 106                word = word[1:] 
    107        return word in digits 107        return word in digits 
    108 108 
    109    def ignore_word(self, word): 109    def ignore_word(self, word): 
    110        "Return True if this word should not be added to the concordance." 110        "Return True if this word should not be added to the concordance." 
    111        bool1 = word in self.words_to_ignore 111        bool1 = word in self.words_to_ignore 
    112        bool2 = word in self.non_words 112        bool2 = word in self.non_words 
    113        # do roman numerals 113        # do roman numerals 
    114        bool3 = self.is_roman_numeral(word) 114        bool3 = self.is_roman_numeral(word) 
    115        return bool1 or bool2 or bool3 115        return bool1 or bool2 or bool3 
    116 116 
    117    def _text_already_done(self, text): 117    def _text_already_done(self, text): 
    118        numrecs = shakespeare.model.Concordance.select( 118        numrecs = shakespeare.model.Concordance.select( 
    119                shakespeare.model.Concordance.q.textID==text.id 119                shakespeare.model.Concordance.q.textID==text.id 
    120                ).count() 120                ).count() 
    121        return numrecs > 0 121        return numrecs > 0 
    122 122 
    123    def add_text(self, name, text=None): 123    def add_text(self, name, text=None): 
    124        """Add a text to the concordance. 124        """Add a text to the concordance. 
    125        @param name: name of text to add 125        @param name: name of text to add 
    126        @param text: [optional] a file-like object containing text data. If not 126        @param text: [optional] a file-like object containing text data. If not 
    127            provided will default to using file in cache associated with named 127            provided will default to using file in cache associated with named 
    128            text 128            text 
    129        """ 129        """ 
    130        dmText = shakespeare.model.Material.byName(name) 130        dmText = shakespeare.model.Material.byName(name) 
    131        if self._text_already_done(dmText): 131        if self._text_already_done(dmText): 
    132            msg = 'Have already added to concordance text: %s' % dmText 132            msg = 'Have already added to concordance text: %s' % dmText 
    133            # raise ValueError(msg) 133            # raise ValueError(msg) 
    134            print msg 134            print msg 
    135            print 'Skipping' 135            print 'Skipping' 
    136            return 136            return 
    137        if text is None: 137        if text is None: 
    138            tpath = dmText.get_cache_path('plain'138            tpath = dmText.get_text(
    139            text = file(tpath) 139            text = file(tpath) 
    140        lineCount = 0 140        lineCount = 0 
    141        charIndex = 0 141        charIndex = 0 
    142        stats = {} 142        stats = {} 
    143        trans = shakespeare.model.Concordance._connection.transaction() 143        trans = shakespeare.model.Concordance._connection.transaction() 
    144        for line in text.readlines(): 144        for line in text.readlines(): 
    145            for match in self.word_regex.finditer(line): 145            for match in self.word_regex.finditer(line): 
    146                word = match.group().lower() # case insensitive 146                word = match.group().lower() # case insensitive 
    147                if self.ignore_word(word): 147                if self.ignore_word(word): 
    148                    continue 148                    continue 
    149                shakespeare.model.Concordance(connection=trans, 149                shakespeare.model.Concordance(connection=trans, 
    150                                           text=dmText, 150                                           text=dmText, 
    151                                           word=word, 151                                           word=word, 
    152                                           line=lineCount, 152                                           line=lineCount, 
    153                                           char_index=charIndex+match.start()) 153                                           char_index=charIndex+match.start()) 
    154                stats[word] = stats.get(word, 0) + 1 154                stats[word] = stats.get(word, 0) + 1 
    155            lineCount += 1 155            lineCount += 1 
    156            charIndex += len(line) 156            charIndex += len(line) 
    157        trans.commit() 157        trans.commit() 
    158        trans = shakespeare.model.Concordance._connection.transaction() 158        trans = shakespeare.model.Concordance._connection.transaction() 
    159        for word, value in stats.items(): 159        for word, value in stats.items(): 
    160            tresults  = shakespeare.model.Statistic.select( 160            tresults  = shakespeare.model.Statistic.select( 
    161                sqlobject.AND( 161                sqlobject.AND( 
    162                    shakespeare.model.Statistic.q.textID == dmText.id, 162                    shakespeare.model.Statistic.q.textID == dmText.id, 
    163                    shakespeare.model.Statistic.q.word == word 163                    shakespeare.model.Statistic.q.word == word 
    164                    )) 164                    )) 
    165            try: 165            try: 
    166                dbstat = list(tresults)[0] 166                dbstat = list(tresults)[0] 
    167                dbstat.occurrences += value 167                dbstat.occurrences += value 
    168            except: 168            except: 
    169                shakespeare.model.Statistic( 169                shakespeare.model.Statistic( 
    170                        connection=trans, 170                        connection=trans, 
    171                        text=dmText, 171                        text=dmText, 
    172                        word=word, 172                        word=word, 
    173                        occurrences=value 173                        occurrences=value 
    174                        ) 174                        ) 
    175        trans.commit() 175        trans.commit() 
    176 176 
    177 177 
    178    def remove_text(self, name): 178    def remove_text(self, name): 
    179        """Remove a text from the concordance. 179        """Remove a text from the concordance. 
    180 180 
    181        @param name: as for add_text 181        @param name: as for add_text 
    182        """ 182        """ 
    183        dmText = shakespeare.model.Material.byName(name) 183        dmText = shakespeare.model.Material.byName(name) 
    184        recs = shakespeare.model.Concordance.select( 184        recs = shakespeare.model.Concordance.select( 
    185                shakespeare.model.Concordance.q.textID==dmText.id 185                shakespeare.model.Concordance.q.textID==dmText.id 
    186                ) 186                ) 
    187        for rec in recs: 187        for rec in recs: 
    188            shakespeare.model.Concordance.delete(rec.id) 188            shakespeare.model.Concordance.delete(rec.id) 
    189        stats = shakespeare.model.Statistic.select( 189        stats = shakespeare.model.Statistic.select( 
    190                shakespeare.model.Statistic.q.textID==dmText.id 190                shakespeare.model.Statistic.q.textID==dmText.id 
    191                ) 191                ) 
    192        for stat in stats: 192        for stat in stats: 
    193            shakespeare.model.Statistic.delete(stat.id) 193            shakespeare.model.Statistic.delete(stat.id) 
    194 194 
  • trunk/shakespeare/controllers/site.py

    Revision 150 Revision 154
    1import logging 1import logging 
    2 2 
    3import genshi 3import genshi 
    4 4 
    5from shakespeare.lib.base import * 5from shakespeare.lib.base import * 
    6 6 
    7import shakespeare 7import shakespeare 
    8import shakespeare.index 8import shakespeare.index 
    9import shakespeare.format 9import shakespeare.format 
    10import shakespeare.concordance 10import shakespeare.concordance 
    11import shakespeare.model as model 11import shakespeare.model as model 
    12 12 
    13# import this after dm so that db connection is set 13# import this after dm so that db connection is set 
    14import annotater.store 14import annotater.store 
    15import annotater.marginalia 15import annotater.marginalia 
    16 16 
    17log = logging.getLogger(__name__) 17log = logging.getLogger(__name__) 
    18 18 
    19 19 
    20class SiteController(BaseController): 20class SiteController(BaseController): 
    21 21 
    22    def index(self): 22    def index(self): 
    23        c.works_index = shakespeare.index.all 23        c.works_index = shakespeare.index.all 
    24        return render('index') 24        return render('index') 
    25 25 
    26    def guide(self): 26    def guide(self): 
    27        return render('guide') 27        return render('guide') 
    28 28 
    29    def view(self): 29    def view(self): 
    30        name = request.params.get('name', '') 30        name = request.params.get('name', '') 
    31        format = request.params.get('format', 'plain') 31        format = request.params.get('format', 'plain') 
    32        if format == 'annotate': 32        if format == 'annotate': 
    33            return self.view_annotate(name) 33            return self.view_annotate(name) 
    34        namelist = name.split() 34        namelist = name.split() 
    35        numtexts = len(namelist) 35        numtexts = len(namelist) 
    36        textlist = [model.Material.byName(tname) for tname in namelist] 36        textlist = [model.Material.byName(tname) for tname in namelist] 
    37        # special case (only return the first text) 37        # special case (only return the first text) 
    38        if format == 'raw': 38        if format == 'raw': 
    39            tpath = textlist[0].get_cache_path('plain') 39            result = textlist[0].get_text().read() 
    40            result = file(tpath).read()   
    41            status = '200 OK' 40            status = '200 OK' 
    42            response.headers['Content-Type'] = 'text/plain' 41            response.headers['Content-Type'] = 'text/plain' 
    43            return result 42            return result 
    44        texts = [] 43        texts = [] 
    45        for item in textlist: 44        for item in textlist: 
    46            tpath = item.get_cache_path('plain') 45            tfileobj = item.get_text() 
    47            tfileobj = file(tpath)   
    48