![]() | |
#1
| |||
| |||
|
|
ZURR_LISMO Transcriptional regulator ZurR - Listeria monocytogenes MGLTATEALMKMKEKGYKHTDKREFLINLLARKNKYLTAKDVLENMKDDF PGISFDTIYR |
#2
| |||
| |||
|
|
I know that this the most general group on databases, and I did not read the FAQ (the reason being that I did not find :-) ). I am not a computer scientist, and I need to implement a simple database. My problem is from the general area of the "which db to choose" problems -- so if you mind this type of questions here, please tell me to go away to another newsgroup. I have a flat text file with FASTA-like sequences. FASTA format, in biology, is a mega-super-simple format where each record starts with a line starting with a ">" and identifier(s), e.g.: ZURR_LISMO Transcriptional regulator ZurR - Listeria monocytogenes MGLTATEALMKMKEKGYKHTDKREFLINLLARKNKYLTAKDVLENMKDDF PGISFDTIYR NLSLFVELGIFEETDLSGERNFRLACTHEHHHHHFICMKCGKTKEIMMCP MDFLTEALPG YQIDGHKFEVYGECPECLQAS The flat text file can have several million records like that, the records can have up to few thousand characters, the whole flat text file can have several gigs. My aim is to be able to access these records as fast as possible on a Unix workstation from local hard disk. Furthermore, I need to access the records from Perl. I do not need any complex queries; essentially, I need to access records by their canonical ID ("accession number"). At first, for simplicity reasons, I was using a Berkeley-style db (NDBM to be specific), which can be tied to a Perl hash (this makes it super easy to use). The DB was populated with numbers, and each number was the position (within the file) of the given record, as used by seek and tell. To read the records, I was opening the file, tying the hash, getting the position, seeking to that position and reading the record until the next ">". The reasoning behind that was: flat files are still needed for other purposes, why should I duplicate the information in the db? seeking on a file should be very fast. This approach, however, turned out to be problematic, as the NDBM files are platform specific. It was also not very fast compared to my next solution, which is SQLite. I did choose SQLite and not MySQL, because for each of the biological databases I have a neatly formatted SQLite file that I can carry to other computers easily and that can be easily renamed / versioned etc. (w/o learning too much about MySQL or PostgresSQL). Again, the procedure is similar: first, I access the DB to get the position within the flat text file, then I open the flat text file and read the record at this position. This is quite fast... unless you need to do several thousand queries as-fast-as-possible. I did some benchmarking; it shows that the fist step (accessing the sqlite db) takes roughly 1/3 of the total time, and getting the records the remaining 2/3. For larger databases, the proportion shifts towards the former (that is, SQL lookup takes longer relative to the seek() on the flat text file). I am now trying to reformat the databases such the SQLite files contain the records directly (this seems to be faster, but I get huge SQLite files). I have also tried MySQL, but this was not better than SQLite on my system and with my setup (maybe I should try it again?). I know that without many more details (which I will be happy to provide) it is hard to answer my questions, but I know so little about databases that I do not really know what information to add and where to ask. My main question is: what can I / should I do? What other optimisations can be done? Maybe the whole approach is flawed? Any ideas, suggestions, comments? j. |
![]() |
| Thread Tools | Search this Thread |
| Display Modes | |
| |