CCP4i2 Developer Notes

CCP4i2 Developer Notes - Database

Introduction

Useful docs: w3 schools sql, sql as understood by sqlite, wikipedia for TLAs etc.

i2 uses sqlite accessed by the Python sqlite3 interface. Discussion on other options later. All access should be via the database api, dbapi/CCP4DbApi.py and the schema is dbapi/database_schema.sql. The db is usually opened as part of startup by utils/startup.py:startDb() function and can be accesed via CCP4Modules.PROJECTSMANAGER().db(). By default the user database file is $HOME/.CCP4I2/db/database.sqlite but this can be changed either as command line argument (-db) or by the environment variables:
CCP4_LOCAL_HOME which resets where .CCP4I2 and other project directories are by default
CCP4_LOCAL_DOTDIR resets .CCP4I2 parent path
These are mostly useful for classroom environments with disk space and/or network speed issues.

Note that Python sqlite has a stated requirement that the sqlite is in the main thread. The database is accessed by both the main gui process and the 'background' runTask processes. I have tried to minimise all access but especially from runTask processes.

The present implementation expects the db to have a single user access - the users login name is found automatically and checked against the value saved in Users table (set when database created) and may fail if they do not match. The gui (in startup.py:startDb() function) deals with this by asking the user to provide the 'old' username as found in the db. There are tables in db such as ProjectsUsersPermissions which are intended for a more sophisticated multi-user access but are not fully implemented.

Each of the main tables in the database has an id which is a randomly generated unique identifier that serves as the primary key. In Python this is handled as a string (or as defined by CCP4DbApi.UUIDTYPE valriable which is currently set to str) and is widely used. There are also some tables (FileTypes, JobStatus) which contain list of integer 'enums' that are used in the 'main' tables

To inspect the db in the Job list the context (right mouse) menus for jobs and files have a View->Database entry option. Also the program sqlitebrowser is very useful and can also be used to edit a db.

Overview of CDbApi methods

All access to the db is channeled through the excute() and commit() methods so changes needed when porting to alternative sql are hopefully localised to these methods. Most methods in CDbApi create an sql command and use execute() (always!) to execute that command. This function traps and reports errors and if setDiagnostic() has been set True it will report all executed sql commands. If the sql command has changed the database then this must be followed by a commit(). If the change is not commited then the process does not release the db and, typically for the case of the gui holding the db, prevents the runTask from reporting a job finished so it appears to run forever. If the sql command is a 'getter' returning data then the fetchAll2Py() or fetchAll2PyList() method should be called to return the data as lists of appropriate Python data types. The difference between these functions is that fetchAll2Py() returns a single list of a single column of data from the db but fetchAll2PyList() returns a list of lists of multiple columns. Both of these functions require as input either the Python type or list of Python types for the returned data.

For the main tables in the database there are methods:
createWhatever() - create new row in table
updateWhatever(whateverId=None,key=None,value=None) - update key column in row to value.
getWhateverInfo(whateverId,mode=None) - return a dictionary of data for the whatever with id whateverId for all the columns listed in mode argument. If mode is the default, None, then all the columns are returned. If mode is a single column name then the single Python value is return rather than a dict. The keys in the return dict are the column names but all lower case.
deleteWhatever(whateverId) - delete the row with id whateverId

Note that, for convenience, the much used getJobInfo() can return data not in the Jobs table: projectname, runtime (finish-start time), parentjobnumber, childjobs, descendentjobs, performanceclass, performance.

There are numerous methods to return more complex data combinations required in specific gui context, to load data from finished jobs, to handle running jobs(?LINK), and to support import/export of projects(?LINK).

The following review of db tables mentions several redundant and 'not yet fully implemented' tables and columns. There is a difficulty the sqlite update will not remove columns - this can only be done by copying the db without the no-longer-required features.

Projects and related tables

The Projects table holds:

ProjectName - name as seen by user

ProjectDirectory - the full path to the prpject directory

UserId

- references to users table which holds the owners login name

ProjectCreated - when project created

LastAccess - when user last appended project

ParentProjectId - reference to another project which is parent to this on. User docs.
LastJobNumber - the number of the last job created in the project
FollowFromJobID - the last finished job whose output files become input to next created job
LastCleanupTime - last time prject cleanup - redundant.
I1ProjectName - Name of equivalent i1 project. User docs.
I1ProjectDirectory - Directory of equivalent i1 project.

ProjectsUsersPermissions is a table to hold project access details for users other than the owner - not currently used.

ProjectExports and ProjectImports tables track whenever a project in exported or imported. Currently this info is used in the gui to offer user option to export only jobs created since last import/export. ProjectImports.ProjectExportDatabaseId references the databaseId of the source of the imported data. There is potentially useful info on 'where did that job come from' in these tables but not currently accessible from the gui.

Recently (Oct 2016) tables to support tags and comments on projects have been added. ?LINK User docs. The Tags table is a simple list of tags but with Tags.ParentTagID enables grouping of tags into heirarchy. The ProjectTags table associates a project with a tag. The ProjectComments table saves comments with userid and time and potentially could be used for 'conversation' on the project.

Jobs and related tables

ProjectID - reference to project containing job
ParentJobID - reference to parent job if it is sub-job in a pipeline
JobNumber - a text string that identifies job to user (actually just an integer)
JobTitle - short description as appears in gui and is editable by user
UserId - user that created job
UserAgent - application that created job - currently always i2 gui
CreationTime - time that job created, replaced with time it starts running
FinishTime - time job finishs
Status - reference to JobStatus table and CCP4DbApi.JOB_STATUS_TEXT list one-word status
TaskName - name of task
TaskVersion - version number of the task
Evaluation - reference to JobEvaluation - appears as smiley/sad/etc icons in the gui Job list
PreceedingJobID - reference to job from which input data was taken - this is tricky to define and not now very useful
processId

The Comments table is a list of comments on specific jobs.

JobKeyValues and JobKeyCharValues hold items of real or char data performance indicators that summaries the quality of the job result. See dev docs. The KeyTypes table lists the allowed JobKeyValues.KeyTypeID values.

Files and related tables

A record of a file is created in the Files table when the file is either imported into i2 or the file is output by a job. The Files record includes the jobId of the job importing/creating the file. If the file was imported then it is also recorded in the ImportedFiles table (whose columns include the name of the original source file).

FileUses The file may be subsequently input into another job and the FileUses table is used to associate the file with that other job (with Role value indicating 'in'). A further complication is that a file may be ouput twice (sort-of) by a sub-job in a pipeline and then being one of the final files output by the whole pipeline - the file needs to be associated with two jobs and so a FileUses entry (with a Role column value indicating 'out') is used for the second job.

Filename The Files.Filename column contains just the name of the file (no path). A file is always expected to be in the job directory of the job that created or imported it or in the CCP4_IMPORTED_FILES directory for the project. The Files.PathFlag column indicates which of these it is. There is a redundant Files.FilePath column (which can not be easily removed form an sqlite db).

JobParamName In both Files and FileUses tables there is the jobId of a job and also a JobParamName column which contains the name of the parameter from the task (e.g. XYZIN).

Annotation Files.Annotation is a short text string to appear in the gui.

File types There are three Files columns containing flags to denote some aspect of file type:
Files.FiletypeID is the fundamental type which corresponds to a single data class derived from CDataFile and possible values listed in CCP4DbApi.FILETYPELIST.
Files.FileContent Some types of file (notable the 'mini' MTZs) can hold data in different forms and this column flags that form - permissible values depend on the Files.FiletypeID.
Files.FileSubType For some types of file (eg CPdbDataFile) the content of the file may only be appropriate for use in a some context that use that type of file. This is an attempt at codifying that problem - it is little used in practice.
Note -- The names FileContent and FileSubType would probably be more appropriate if reversed!

ImportFiles This table exists mostly to hold the SourceFilename of an imported file. It also contains a checksum value for the source file at the time of import. Any subsequent access to the source file should check it has the same checksum i.e. is still the same file.

ExportFiles Ideally when a user exports a file from i2 (presumably to use in some other software) the export is noted in this table. When another file is subsequently imported (presumably the output from the other software) it should be associated with the exported file. There is a column ImportFiles.ExportFileID to make this connection. This is not fully implemented and used partly due to user resistance to being organised.

FileAssociations with support tables FileAssociationTypes, FileAssociationRoles and FileAssociationMembers is intended to make an association between two files such as the reflection data and freer data. This is fully implemented in CDbApi but not used yet. The CDbApi.createFileAssociation() method creates an entry in the FileAssociations table and then for every file in the association adds an entry to FileAssociationMembers with a reference to the FileAssociations entry.

Loading file data. The params.def.xml file associated with the job contains all parameters associated with the job including the input and output data files. The CDbApi.gleanJobFiles() method extracts data from the params.def.xml and creates the Files and FileUse records. The gleanJobFiles() method is passed a CContainer instance containing the contents of the param file. The method scans the inputData and outputData sub-containers for objects of type derived from CDataFile and checks if the spcified file name exists before creating a Files or FileUses record.

Databases and other assorted tables

The Databases table keeps a record of itself and all databases from which data has been imported.
UserAgents is a table of 'enums' for applications which access the db - currently just i2 gui.
UserRoles is enums for concept of different type of user - not used.
Users - List of users - currently just the one user.
DirectoryAliases - copies i1 functionality of allowing user to create alias to frequently user directory. No longer used.
XData - a table that could hold a text xml representation of any data class - not used but potentially could be.
LastUniqueIds -redundant

Updating the schema

Make a copy of your .CCP4I2/db directory

Change the version and data at the top of the database_schema.sql comments

Add a note of the changes to the list of changes in the schema comments

Make the changes in database_schema.sql

Add an update_version() method in CCP4DbApi.py - note that the version is what you are updating from (not to).

Add a line to updateDbSchema() to call your new method

Update the CDbApi.WHATEVERITEMS and CDbApi.WHATEVERTYPES lists that are the CDbApi classes record of the data in each table and what Python data type they should be converted to.

Concider any changes necessary to createWhatever(), updateWhatever(), getWhateverInfo(), and deleteWhatever() methods.

If a new table has reference to Projects, Jobs or Files tables then it will need to be included in the deleteProject(), deleteJob() or deleteFile() methods.

Many schema changes need to be replicated in dbapi/temp_database_sqema.sql which is used to temporarilly hold data when importing projects. Some 'transient' data - e.g. ServerJobs, tracking remote running jobs, should not be exported so does not need to be in the 'temp' database. See import/export note.
Test that you existing db is updated safely and then try with the .CCP4I2 moved aside so db is created from scratch.
Test export and import of a project.
Before checking in inform ysbl-ccp4i2dev and advise them to save .CCP4I2/db - I would do this for all but the most trivial change.

CDbUtils.COpenJob

This class is used as a cache for info on one job to minimise db access. It's contents are updated if the db is updated and it emits appropriate change signals. The class is mostly used as a member of CProjectViewer class.

COpenJob also contains methods to create and run jobs so it has some of the functionality of CProjectViewer but accessible programatically and non-graphically. COpenJob is currently used in the project based testing (CCP4ProjectBasedTesting.py). It could be useful in an automated system such as the demo_i2_scripts.

Exporting and Importing Projects

i2 can export a whole project or some selected jobs from one project by creating a compressed (zip) file containing all or some restricted part of the project directory and an xml representation of the appropriate part of the database. When the compressed file is imported i2 creates a new project or adds to an existing project as appropriate. The import mechanism is careful not to import jobs or files that already exist. The import/export mechanism is heavily dependent on the UUIDs (Wikipedia) Projects.projectId, Jobs.JobId, Files.FileId to check what is already in the database.

The interface to import/export is in CCP4ProjectsManagerGui.py and CCP4ProjectsManager.compressProject() is used to export selected jobs. An outline of the export process:

If a limited set of jobs is being exported use CDbApi.getJobInputFiles() to list the files input to those jobs and also list the jobs that provide the input files. These additional files and jobs will be included in the export.
Create an XML file, project_directory/CCP4_TMP/DATABASE_time.db.xml, containing a fragment of database created by CDbApi.getTablesEtree()
Create the compressed file in a separate thread using CCP4Export.ExportProjectThread

The import process is a tricky piece of code and should be changed with extreme caution. Note the CCP4Export.ImportProjectThread is not actually a thread as could not access db from separate thread. The basic process:

Extract the project xml file from the compressed file (ProjectsManager.extractDatabaseXml()) and load the file header in CCP4DbApi.CDbXml class to get basic project info.
Check the ProjectId of the imported project - dependent on whether it is already in db decide to create new or extend existing project - gui confirms this with the user.
CDbXml.loadTempTables() loads the xml into temporary tables.
The excl column in the temporary tables flags any job or file that should not be imported because it already exists or the actual job directory or file failed to import to the project directory. CDbXml.setExclInTempTables() sets excl for any job/file that already exists in db and determines if the imported job numbers need changing.
CDbXml.setExclImportedFiles() why?
CCP4Export.ImportProjectThread unpacks compressed file to the project directory - this has access to the CDbXml object to check if file is to be copied - see ImportProjectThread.extractJobsFromZip()
CDbXml.cleanupTempTables() updates dependent temporary tables based on TempJobs.excl
CDbXml.importTempTables() copies data from the temporary to main tables in database
CDbXml.removeTempTables() deletes the temporary tables.

Schema syncing issues. There is a potential for problems if the import/export i2 installation have different database schema. I believe the import code can handle getting tables or columns that it does not recognise or missing tables and columns but there may be issues with missing content.

Future database development issues

There has been discussion on enabling an alternative RDBMS to sqlite - the requiremenst for any alternative would be that they are free and ideally can be distributed by ccp4 or are easy for a user site to install. Comprehensive list RDBMS at wikipedia. The advantage would be for a user site to have a central database for everyones projects. One way to use this would be for the local sqlite db to remain the immediate db but to update to a central db regularly.

Some notes:

Qt has a module to access some of the main RDBMSs- probably not built into ccp4 Qt.
Obviously any centralised db must sort out security and which users access which projects.
The 'handshaking' to the RRBMS is mostly handled by CDbApi.connection(), CDbApi.execute() and CDbApi.commit() and hopefully any changes can be localised to those methods.
The sql language has different dialects - that used in i2 is mostly very simple and generic and should port without problems - watch errors from CDbApi.execute() when testing.
sqlite does not have a easy way to remove columns so there is some redundant stuff in the i2 db. The only way to remove it is to copy the whole database (without the unwanted bits) and this is probably worth doing as part of a move to any alternative sql base.