CCP4i2 Developer Notes - Database

Introduction

Useful docs: w3 schools sql, sql as understood by sqlite, wikipedia for TLAs etc.

i2 uses sqlite accessed by the Python sqlite3 interface. Discussion on other options later. All access should be via the database api, dbapi/CCP4DbApi.py and the schema is dbapi/database_schema.sql. The db is usually opened as part of startup by utils/startup.py:startDb() function and can be accesed via CCP4Modules.PROJECTSMANAGER().db(). By default the user database file is $HOME/.CCP4I2/db/database.sqlite but this can be changed either as command line argument (-db) or by the environment variables:
CCP4_LOCAL_HOME which resets where .CCP4I2 and other project directories are by default
CCP4_LOCAL_DOTDIR resets .CCP4I2 parent path
These are mostly useful for classroom environments with disk space and/or network speed issues.

Note that Python sqlite has a stated requirement that the sqlite is in the main thread. The database is accessed by both the main gui process and the 'background' runTask processes. I have tried to minimise all access but especially from runTask processes.

The present implementation expects the db to have a single user access - the users login name is found automatically and checked against the value saved in Users table (set when database created) and may fail if they do not match. The gui (in startup.py:startDb() function) deals with this by asking the user to provide the 'old' username as found in the db. There are tables in db such as ProjectsUsersPermissions which are intended for a more sophisticated multi-user access but are not fully implemented.

Each of the main tables in the database has an id which is a randomly generated unique identifier that serves as the primary key. In Python this is handled as a string (or as defined by CCP4DbApi.UUIDTYPE valriable which is currently set to str) and is widely used. There are also some tables (FileTypes, JobStatus) which contain list of integer 'enums' that are used in the 'main' tables

To inspect the db in the Job list the context (right mouse) menus for jobs and files have a View->Database entry option. Also the program sqlitebrowser is very useful and can also be used to edit a db.

Overview of CDbApi methods

All access to the db is channeled through the excute() and commit() methods so changes needed when porting to alternative sql are hopefully localised to these methods. Most methods in CDbApi create an sql command and use execute() (always!) to execute that command. This function traps and reports errors and if setDiagnostic() has been set True it will report all executed sql commands. If the sql command has changed the database then this must be followed by a commit(). If the change is not commited then the process does not release the db and, typically for the case of the gui holding the db, prevents the runTask from reporting a job finished so it appears to run forever. If the sql command is a 'getter' returning data then the fetchAll2Py() or fetchAll2PyList() method should be called to return the data as lists of appropriate Python data types. The difference between these functions is that fetchAll2Py() returns a single list of a single column of data from the db but fetchAll2PyList() returns a list of lists of multiple columns. Both of these functions require as input either the Python type or list of Python types for the returned data.

For the main tables in the database there are methods:
createWhatever() - create new row in table
updateWhatever(whateverId=None,key=None,value=None) - update key column in row to value.
getWhateverInfo(whateverId,mode=None) - return a dictionary of data for the whatever with id whateverId for all the columns listed in mode argument. If mode is the default, None, then all the columns are returned. If mode is a single column name then the single Python value is return rather than a dict. The keys in the return dict are the column names but all lower case.
deleteWhatever(whateverId) - delete the row with id whateverId

Note that, for convenience, the much used getJobInfo() can return data not in the Jobs table: projectname, runtime (finish-start time), parentjobnumber, childjobs, descendentjobs, performanceclass, performance.

There are numerous methods to return more complex data combinations required in specific gui context, to load data from finished jobs, to handle running jobs(?LINK), and to support import/export of projects(?LINK).

The following review of db tables mentions several redundant and 'not yet fully implemented' tables and columns. There is a difficulty the sqlite update will not remove columns - this can only be done by copying the db without the no-longer-required features.

Projects and related tables

The Projects table holds:

ProjectsUsersPermissions is a table to hold project access details for users other than the owner - not currently used.

ProjectExports and ProjectImports tables track whenever a project in exported or imported. Currently this info is used in the gui to offer user option to export only jobs created since last import/export. ProjectImports.ProjectExportDatabaseId references the databaseId of the source of the imported data. There is potentially useful info on 'where did that job come from' in these tables but not currently accessible from the gui.

Recently (Oct 2016) tables to support tags and comments on projects have been added. ?LINK User docs. The Tags table is a simple list of tags but with Tags.ParentTagID enables grouping of tags into heirarchy. The ProjectTags table associates a project with a tag. The ProjectComments table saves comments with userid and time and potentially could be used for 'conversation' on the project.

Jobs and related tables

The Comments table is a list of comments on specific jobs.

JobKeyValues and JobKeyCharValues hold items of real or char data performance indicators that summaries the quality of the job result. See dev docs. The KeyTypes table lists the allowed JobKeyValues.KeyTypeID values.

Files and related tables

A record of a file is created in the Files table when the file is either imported into i2 or the file is output by a job. The Files record includes the jobId of the job importing/creating the file. If the file was imported then it is also recorded in the ImportedFiles table (whose columns include the name of the original source file).

FileUses The file may be subsequently input into another job and the FileUses table is used to associate the file with that other job (with Role value indicating 'in'). A further complication is that a file may be ouput twice (sort-of) by a sub-job in a pipeline and then being one of the final files output by the whole pipeline - the file needs to be associated with two jobs and so a FileUses entry (with a Role column value indicating 'out') is used for the second job.

Filename The Files.Filename column contains just the name of the file (no path). A file is always expected to be in the job directory of the job that created or imported it or in the CCP4_IMPORTED_FILES directory for the project. The Files.PathFlag column indicates which of these it is. There is a redundant Files.FilePath column (which can not be easily removed form an sqlite db).

JobParamName In both Files and FileUses tables there is the jobId of a job and also a JobParamName column which contains the name of the parameter from the task (e.g. XYZIN).

Annotation Files.Annotation is a short text string to appear in the gui.

File types There are three Files columns containing flags to denote some aspect of file type:
Files.FiletypeID is the fundamental type which corresponds to a single data class derived from CDataFile and possible values listed in CCP4DbApi.FILETYPELIST.
Files.FileContent Some types of file (notable the 'mini' MTZs) can hold data in different forms and this column flags that form - permissible values depend on the Files.FiletypeID.
Files.FileSubType For some types of file (eg CPdbDataFile) the content of the file may only be appropriate for use in a some context that use that type of file. This is an attempt at codifying that problem - it is little used in practice.
Note -- The names FileContent and FileSubType would probably be more appropriate if reversed!

ImportFiles This table exists mostly to hold the SourceFilename of an imported file. It also contains a checksum value for the source file at the time of import. Any subsequent access to the source file should check it has the same checksum i.e. is still the same file.

ExportFiles Ideally when a user exports a file from i2 (presumably to use in some other software) the export is noted in this table. When another file is subsequently imported (presumably the output from the other software) it should be associated with the exported file. There is a column ImportFiles.ExportFileID to make this connection. This is not fully implemented and used partly due to user resistance to being organised.

FileAssociations with support tables FileAssociationTypes, FileAssociationRoles and FileAssociationMembers is intended to make an association between two files such as the reflection data and freer data. This is fully implemented in CDbApi but not used yet. The CDbApi.createFileAssociation() method creates an entry in the FileAssociations table and then for every file in the association adds an entry to FileAssociationMembers with a reference to the FileAssociations entry.

Loading file data. The params.def.xml file associated with the job contains all parameters associated with the job including the input and output data files. The CDbApi.gleanJobFiles() method extracts data from the params.def.xml and creates the Files and FileUse records. The gleanJobFiles() method is passed a CContainer instance containing the contents of the param file. The method scans the inputData and outputData sub-containers for objects of type derived from CDataFile and checks if the spcified file name exists before creating a Files or FileUses record.

Databases and other assorted tables

Updating the schema

CDbUtils.COpenJob

This class is used as a cache for info on one job to minimise db access. It's contents are updated if the db is updated and it emits appropriate change signals. The class is mostly used as a member of CProjectViewer class.

COpenJob also contains methods to create and run jobs so it has some of the functionality of CProjectViewer but accessible programatically and non-graphically. COpenJob is currently used in the project based testing (CCP4ProjectBasedTesting.py). It could be useful in an automated system such as the demo_i2_scripts.

Exporting and Importing Projects

i2 can export a whole project or some selected jobs from one project by creating a compressed (zip) file containing all or some restricted part of the project directory and an xml representation of the appropriate part of the database. When the compressed file is imported i2 creates a new project or adds to an existing project as appropriate. The import mechanism is careful not to import jobs or files that already exist. The import/export mechanism is heavily dependent on the UUIDs (Wikipedia) Projects.projectId, Jobs.JobId, Files.FileId to check what is already in the database.

The interface to import/export is in CCP4ProjectsManagerGui.py and CCP4ProjectsManager.compressProject() is used to export selected jobs. An outline of the export process:

The import process is a tricky piece of code and should be changed with extreme caution. Note the CCP4Export.ImportProjectThread is not actually a thread as could not access db from separate thread. The basic process:

Schema syncing issues. There is a potential for problems if the import/export i2 installation have different database schema. I believe the import code can handle getting tables or columns that it does not recognise or missing tables and columns but there may be issues with missing content.

Future database development issues

There has been discussion on enabling an alternative RDBMS to sqlite - the requiremenst for any alternative would be that they are free and ideally can be distributed by ccp4 or are easy for a user site to install. Comprehensive list RDBMS at wikipedia. The advantage would be for a user site to have a central database for everyones projects. One way to use this would be for the local sqlite db to remain the immediate db but to update to a central db regularly.

Some notes: