dbTalk Databases Forums  

Newbie question

comp.databases comp.databases


Discuss Newbie question in the comp.databases forum.



Reply
 
Thread Tools Search this Thread Display Modes
  #1  
Old   
January Weiner
 
Posts: n/a

Default Newbie question - 11-23-2007 , 12:10 PM






Dear all,

I know that this the most general group on databases, and I did not read
the FAQ (the reason being that I did not find :-) ).

I am not a computer scientist, and I need to implement a simple database.
My problem is from the general area of the "which db to choose" problems --
so if you mind this type of questions here, please tell me to go away to
another newsgroup.

I have a flat text file with FASTA-like sequences. FASTA format, in
biology, is a mega-super-simple format where each record starts with a line
starting with a ">" and identifier(s), e.g.:

Quote:
ZURR_LISMO Transcriptional regulator ZurR - Listeria monocytogenes
MGLTATEALMKMKEKGYKHTDKREFLINLLARKNKYLTAKDVLENMKDDF PGISFDTIYR
NLSLFVELGIFEETDLSGERNFRLACTHEHHHHHFICMKCGKTKEIMMCP MDFLTEALPG
YQIDGHKFEVYGECPECLQAS

The flat text file can have several milions records like that, the records
can have up to few thousand characters, the whole flat text file can have
several gigs.

My aim is to be able to access these records as fast as possible on a Unix
workstation from local hard disk. Furthermore, I need to access the
records from Perl. I do not need any complex queries; essentially, I need to
access records by their canonical ID ("accession number").

At first, for simplicity reasons, I was using a Berkeley-style db (NDBM to
be specific), which can be tied to a Perl hash (this makes it super easy to
use). The DB was populated with numbers, and each number was the position
(within the file) of the given record, as used by seek and tell. To read
the records, I was opening the file, tying the hash, getting the position,
seeking to that position and reading the record until the next ">". The
reasoning behind that was: flat files are still needed for other purposes,
why should I duplicate the information in the db? seeking on a file should
be very fast.

This approach, however, turned out to be problematic, as the NDBM files are
platform specific. It was also not very fast compared to my next solution,
which is SQLite. I did choose SQLite and not MySQL, because for each of
the biological databases I have a neatly formatted SQLite file that I can
carry to other computers easily and that can be easily renamed / versioned
etc. (w/o learning too much about MySQL or PostgresSQL).

Again, the procedure is similar: first, I access the DB to get the position
within the flat text file, then I open the flat text file and read the
record at this position.

This is quite fast... unless you need to do several thousand queries
as-fast-as-possible. I did some benchmarking; it shows that the fist step
(accessing the sqlite db) takes roughly 1/3 of the total time, and getting
the records the remaining 2/3. For larger databases, the proportion shifts
towards the former (that is, SQL lookup takes longer relative to the seek()
on the flat text file).

I am now trying to reformat the databases such the SQLite files contain the
records directly (this seems to be faster, but I get huge SQLite files). I
have also tried MySQL, but this was not better than SQLite on my system and
with my setup (maybe I should try it again?).

I know that without many more details (which I will be happy to provide) it
is hard to answer my questions, but I know so little about databases that I
do not really know what information to add and where to ask.

My main question is: what can I / should I do? What other optimisations
can be done? Maybe the whole approach is flawed? Any ideas, suggestions,
comments?

j.

--
---------Dr. January Weiner 3 -------------------------------------
Evolutionary Bioinformatics, University of Muenster
http://www.uni-muenster.de/Evolution/ebb/


Reply With Quote
  #2  
Old   
CRPence
 
Posts: n/a

Default Re: Newbie question - 11-28-2007 , 09:16 PM






I am not familiar with SQLlite, but there is a CREATE [UNIQUE] INDEX,
for which access by a specific key should allow fast access; i.e. such
that the access times should not go up significantly as the number of
rows increase. Was an INDEX created for the database, such that a
unique "accession number" was being used to select the row using?:
WHERE "accession_number" = the_unique_value
http://www.sqlite.org/lang_createindex.html

The doc does suggest [for /very large datasets/, but it seems
comments are for insert activity only] that a larger page size for large
databases, and that several GB databases may not be an appropriate use
for that database; i.e. when combining all of the data into the table.
http://www.sqlite.org/whentouse.html

Regards, Chuck
--
All comments provided "as is" with no warranties of any kind
whatsoever and may not represent positions, strategies, nor views of my
employer

January Weiner wrote:
Quote:
I know that this the most general group on databases, and I did not read
the FAQ (the reason being that I did not find :-) ).

I am not a computer scientist, and I need to implement a simple database.
My problem is from the general area of the "which db to choose" problems --
so if you mind this type of questions here, please tell me to go away to
another newsgroup.

I have a flat text file with FASTA-like sequences. FASTA format, in
biology, is a mega-super-simple format where each record starts with a line
starting with a ">" and identifier(s), e.g.:

ZURR_LISMO Transcriptional regulator ZurR - Listeria monocytogenes
MGLTATEALMKMKEKGYKHTDKREFLINLLARKNKYLTAKDVLENMKDDF PGISFDTIYR
NLSLFVELGIFEETDLSGERNFRLACTHEHHHHHFICMKCGKTKEIMMCP MDFLTEALPG
YQIDGHKFEVYGECPECLQAS

The flat text file can have several million records like that, the records
can have up to few thousand characters, the whole flat text file can have
several gigs.

My aim is to be able to access these records as fast as possible on a Unix
workstation from local hard disk. Furthermore, I need to access the
records from Perl. I do not need any complex queries; essentially, I need to
access records by their canonical ID ("accession number").

At first, for simplicity reasons, I was using a Berkeley-style db (NDBM to
be specific), which can be tied to a Perl hash (this makes it super easy to
use). The DB was populated with numbers, and each number was the position
(within the file) of the given record, as used by seek and tell. To read
the records, I was opening the file, tying the hash, getting the position,
seeking to that position and reading the record until the next ">". The
reasoning behind that was: flat files are still needed for other purposes,
why should I duplicate the information in the db? seeking on a file should
be very fast.

This approach, however, turned out to be problematic, as the NDBM files are
platform specific. It was also not very fast compared to my next solution,
which is SQLite. I did choose SQLite and not MySQL, because for each of
the biological databases I have a neatly formatted SQLite file that I can
carry to other computers easily and that can be easily renamed / versioned
etc. (w/o learning too much about MySQL or PostgresSQL).

Again, the procedure is similar: first, I access the DB to get the position
within the flat text file, then I open the flat text file and read the
record at this position.

This is quite fast... unless you need to do several thousand queries
as-fast-as-possible. I did some benchmarking; it shows that the fist step
(accessing the sqlite db) takes roughly 1/3 of the total time, and getting
the records the remaining 2/3. For larger databases, the proportion shifts
towards the former (that is, SQL lookup takes longer relative to the seek()
on the flat text file).

I am now trying to reformat the databases such the SQLite files contain the
records directly (this seems to be faster, but I get huge SQLite files). I
have also tried MySQL, but this was not better than SQLite on my system and
with my setup (maybe I should try it again?).

I know that without many more details (which I will be happy to provide) it
is hard to answer my questions, but I know so little about databases that I
do not really know what information to add and where to ask.

My main question is: what can I / should I do? What other optimisations
can be done? Maybe the whole approach is flawed? Any ideas, suggestions,
comments?

j.


Reply With Quote
Reply




Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off



Powered by vBulletin Version 3.5.3
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.