dbTalk Databases Forums  

Parallel access to the values of a table

comp.databases.mysql comp.databases.mysql


Discuss Parallel access to the values of a table in the comp.databases.mysql forum.



Reply
 
Thread Tools Display Modes
  #1  
Old   
gervaz
 
Posts: n/a

Default Parallel access to the values of a table - 08-22-2011 , 03:19 PM






Hi all,
I have a table containing a list of unique words, defined as:

CREATE TABLE word_list (
id int(10) unsigned NOT NULL auto_increment,
word VARCHAR(255) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,
PRIMARY KEY (id),
UNIQUE KEY (word)
) ENGINE=InnoDB;

I have a lot of threads that concurrently try to get the id of a word
from the word_list table and if the word is not found, the word is
inserted and the id returned.

My concern is that I can incurr in integrity errors (unique key) since
I'm using concurrency.

Any idea on how to avoid such problems without removing concurrency?

No idea how to do it, but the solution could be using a lock and a
condition variable to tell every thread that the word's id is
available once one thread is writing that value, but it seems to me a
far complex solution...

Any help?

Thanks,

Mattia

Reply With Quote
  #2  
Old   
Gordon Burditt
 
Posts: n/a

Default Re: Parallel access to the values of a table - 08-22-2011 , 08:01 PM






Quote:
I have a table containing a list of unique words, defined as:

CREATE TABLE word_list (
id int(10) unsigned NOT NULL auto_increment,
word VARCHAR(255) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,
PRIMARY KEY (id),
UNIQUE KEY (word)
) ENGINE=InnoDB;

I have a lot of threads that concurrently try to get the id of a word
from the word_list table and if the word is not found, the word is
inserted and the id returned.
You had better be using separate connections for each thread or you
will have a lot more problems than concurrent updates. Even if you
are doing read-only queries, using the same connection in multiple
threads can draw "out of sequence" errors when you make a second
query on the same connection before retrieving the result of the
first one.

Quote:
My concern is that I can incurr in integrity errors (unique key) since
I'm using concurrency.
The database is supposed to treat individual queries as atomic, and
the result is the same as if they were executed in some particular
order, not a mis-mash of two running at once. If at 12:01:00.000000,
1000 clients issue 1000 queries looking for the word "aint", don't
find it, and 1 millisecond later at 12:01:00.001000, 1000 clients
issue 1000 queries to insert the word "aint", exactly 999 of them
should fail with a duplicate key error and one should succeed.

Note that "a single SQL query is atomic" is *not* sufficient to
remove all concurrency problems. Some things you cannot do with
one query.

MySQL assumes you are always potentially using concurrency. At the
very least, independent processes on different hosts could be making
the queries. It also happens a lot that the queries are being made
locally from a busy web server (which is almost guaranteed to be
multi-threaded or multi-process) but the web hits are coming from
all over the Internet.

Quote:
Any idea on how to avoid such problems without removing concurrency?

No idea how to do it, but the solution could be using a lock and a
condition variable to tell every thread that the word's id is
available once one thread is writing that value, but it seems to me a
far complex solution...
If you can do the whole job in one statement, you shouldn't need
extra locking. Unless there's some clever trick with user-defined
variables, you need two queries, one to insert the word and one to
look up the id. Here you might run into trouble if there is another
client that deletes words or changes word ids.

The preferred method of doing extra locking is to put multiple
queries you want to run into one transaction (and you need to use
InnoDB tables). This permits queries well-known to banks: Transfer
$X from account A to account B (adjusting the balances of each
account), unless one or more of the accounts doesn't exist, or the
transfer would result in an overdraft of account A, in which case
nothing should change (This is in a simplified world where there are
no overdraft fees). If any check fails, roll back the transaction.
It is absolutely unacceptable to "lose" or "create" money by having
half a transaction executed and the other half not.

Reply With Quote
  #3  
Old   
Álvaro G. Vicario
 
Posts: n/a

Default Re: Parallel access to the values of a table - 08-23-2011 , 02:44 AM



El 22/08/2011 22:19, gervaz escribió/wrote:
Quote:
I have a table containing a list of unique words, defined as:

CREATE TABLE word_list (
id int(10) unsigned NOT NULL auto_increment,
word VARCHAR(255) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,
PRIMARY KEY (id),
UNIQUE KEY (word)
) ENGINE=InnoDB;

I have a lot of threads that concurrently try to get the id of a word
from the word_list table and if the word is not found, the word is
inserted and the id returned.

My concern is that I can incurr in integrity errors (unique key) since
I'm using concurrency.

Any idea on how to avoid such problems without removing concurrency?

No idea how to do it, but the solution could be using a lock and a
condition variable to tell every thread that the word's id is
available once one thread is writing that value, but it seems to me a
far complex solution...
Duplicates trigger a 1062 error code:

http://dev.mysql.com/doc/refman/5.5/...r_er_dup_entry

Some people suggest that it's better make a SELECT query prior to
insertion (which possibly requires a table lock to work reliably) but I
can't see anything wrong in adding logic to capture this specific error.


(Can your threads remove data as well?)


--
-- http://alvaro.es - Álvaro G. Vicario - Burgos, Spain
-- Mi sitio sobre programación web: http://borrame.com
-- Mi web de humor satinado: http://www.demogracia.com
--

Reply With Quote
  #4  
Old   
gervaz
 
Posts: n/a

Default Re: Parallel access to the values of a table - 08-23-2011 , 03:25 AM



On 23 Ago, 09:44, "Álvaro G. Vicario"
<alvaro.NOSPAMTH... (AT) demogracia (DOT) com.invalid> wrote:
Quote:
El 22/08/2011 22:19, gervaz escribi /wrote:





I have a table containing a list of unique words, defined as:

CREATE TABLE word_list (
* * *id int(10) unsigned NOT NULL auto_increment,
* * *word VARCHAR(255) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,
* * *PRIMARY KEY (id),
* * *UNIQUE KEY (word)
) ENGINE=InnoDB;

I have a lot of threads that concurrently try to get the id of a word
from the word_list table and if the word is not found, the word is
inserted and the id returned.

My concern is that I can incurr in integrity errors (unique key) since
I'm using concurrency.

Any idea on how to avoid such problems without removing concurrency?

No idea how to do it, but the solution could be using a lock and a
condition variable to tell every thread that the word's id is
available once one thread is writing that value, but it seems to me a
far complex solution...

Duplicates trigger a 1062 error code:

http://dev.mysql.com/doc/refman/5.5/...erver.html#err...

Some people suggest that it's better make a SELECT query prior to
insertion (which possibly requires a table lock to work reliably) but I
can't see anything wrong in adding logic to capture this specific error.

(Can your threads remove data as well?)

--
--http://alvaro.es- lvaro G. Vicario - Burgos, Spain
-- Mi sitio sobre programaci n web:http://borrame.com
-- Mi web de humor satinado:http://www.demogracia.com
--- Nascondi testo citato

- Mostra testo citato -
Some clarifications.

Every thread uses a separate database connection.

When looking for a word id I execute """SELECT id FROM word_list WHERE
word = '{}';"""
If the word is not found, then I use """INSERT INTO word_list (word)
VALUES ('{}');"""
No word deletion nor update is involved.

My concern is regarding the example proposed by Gordon, where I can
have duplicate key errors.

In that case should I raise an exception and execute the SELECT again?
Consider that the whole code is inserted in a python program.

Mattia

Reply With Quote
  #5  
Old   
The Natural Philosopher
 
Posts: n/a

Default Re: Parallel access to the values of a table - 08-23-2011 , 03:36 AM



Gordon Burditt wrote:
If at 12:01:00.000000,
Quote:
1000 clients issue 1000 queries looking for the word "aint", don't
find it, and 1 millisecond later at 12:01:00.001000, 1000 clients
issue 1000 queries to insert the word "aint", exactly 999 of them
should fail with a duplicate key error and one should succeed.
I have exactly this problem. Or nearly so.

I am counting hits from incoming IP addresses.

The logic is

Get the record for this IP address.

If successful, increment the hits counter, record time

ELSE create it with counter as one (and some other stuff)

Sometimes I get duplicate records..

It's not a real world problem (records are for human analysis only) ,
but I'd be curious to know a suitable strategy to avoid it.

Reply With Quote
  #6  
Old   
Álvaro G. Vicario
 
Posts: n/a

Default Re: Parallel access to the values of a table - 08-23-2011 , 04:18 AM



El 23/08/2011 10:25, gervaz escribió/wrote:
Quote:
Every thread uses a separate database connection.

When looking for a word id I execute """SELECT id FROM word_list WHERE
word = '{}';"""
If the word is not found, then I use """INSERT INTO word_list (word)
VALUES ('{}');"""
No word deletion nor update is involved.

My concern is regarding the example proposed by Gordon, where I can
have duplicate key errors.
Not sure of your exact concerns... What Gordon says applies to this.

Quote:
In that case should I raise an exception and execute the SELECT again?
If you write your query in such a way that the only duplicate can be in
the `word` column, you only need to reissue the SELECT query if you need
the value of `id` (the mere fact of getting 1062 means that the word
exists). Assuming no deletions simplifies everything.



--
-- http://alvaro.es - Álvaro G. Vicario - Burgos, Spain
-- Mi sitio sobre programación web: http://borrame.com
-- Mi web de humor satinado: http://www.demogracia.com
--

Reply With Quote
  #7  
Old   
Gordon Burditt
 
Posts: n/a

Default Re: Parallel access to the values of a table - 08-23-2011 , 05:11 AM



Quote:
Every thread uses a separate database connection.

When looking for a word id I execute """SELECT id FROM word_list WHERE
word = '{}';"""
If the word is not found, then I use """INSERT INTO word_list (word)
VALUES ('{}');"""
No word deletion nor update is involved.

My concern is regarding the example proposed by Gordon, where I can
have duplicate key errors.
Your original scenario could have duplicate key errors. I just
proposed an example where it was blatantly obvious that you could
have a lot of them, rather than an occasional one where it would
be hard to reproduce and you are tearing your hair out trying to
figure out how it could happen.

Quote:
In that case should I raise an exception and execute the SELECT again?
Consider that the whole code is inserted in a python program.
Your code can do something like this, but for some error-retentive[*] programmers it's almost impossible to get them to *IGNORE* an
error, which they seem to have confused with a mortal sin.

1. SELECT as above. If the word is found, return the id.
If not, continue to step 2. Any other error is a problem.
2. INSERT as above. If the query works or gets a duplicate key
error, go to step 3. Any other error is a problem.
3. SELECT as above (again). If the word is found, return the id.
If the word is NOT found, either someone deleted it, or there's
something very wrong. In step 2, you either inserted the word
or someone else did and it was there because you got a duplicate
key error. Any other error is also a problem.
4. There should be no need to try repeating the INSERT or SELECT
queries again for the errors discussed. If your network
is unreliable, you can retry for those errors.

Especially in step 2, I see little use for "throwing" here, unless
you can figure out how to make the query "throw" on any error EXCEPT
a duplicate key error. You need to check the result of the query,
and do something different based on the results. "throwing" and
"catching" mostly gets in the way here. Just check the results of
queries. (There are surprisingly many places where the *SUCCESS*
of a lower-level function indicates the *FAILURE* of a higher-level
function (Output file already exists, overwrite?) or vice versa.
Looking up a username in the database and finding it may be an error
("that name is already taken. Please choose another one.")

The "problem" errors could be as simple as typos in the queries,
network connectivity to the server was lost, or as bad as disk
errors in the database or corrupted tables.

[*] Error-retentive programmers are ones who will insist that all
errors must be reported, and to heck with the original function of
the program. If writing on the error log failed, say, because the
disk is full, you must ensure that the disk stays full by reporting
THAT error on the same error log, on to infinity, so nobody will
be able to clear space fast enough. I'm not saying you shouldn't
log errors, but sooner or later if things keep failing, you need
to give up.

Reply With Quote
  #8  
Old   
Jerry Stuckle
 
Posts: n/a

Default Re: Parallel access to the values of a table - 08-23-2011 , 05:19 AM



On 8/23/2011 4:36 AM, The Natural Philosopher wrote:
Quote:
Gordon Burditt wrote:
If at 12:01:00.000000,
1000 clients issue 1000 queries looking for the word "aint", don't
find it, and 1 millisecond later at 12:01:00.001000, 1000 clients
issue 1000 queries to insert the word "aint", exactly 999 of them
should fail with a duplicate key error and one should succeed.

I have exactly this problem. Or nearly so.

I am counting hits from incoming IP addresses.

The logic is

Get the record for this IP address.

If successful, increment the hits counter, record time

ELSE create it with counter as one (and some other stuff)

Sometimes I get duplicate records..

It's not a real world problem (records are for human analysis only) ,
but I'd be curious to know a suitable strategy to avoid it.

Hire someone who knows what they are doing.

--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
jstucklex (AT) attglobal (DOT) net
==================

Reply With Quote
  #9  
Old   
Gordon Burditt
 
Posts: n/a

Default Re: Parallel access to the values of a table - 08-23-2011 , 05:32 AM



Quote:
If at 12:01:00.000000,
1000 clients issue 1000 queries looking for the word "aint", don't
find it, and 1 millisecond later at 12:01:00.001000, 1000 clients
issue 1000 queries to insert the word "aint", exactly 999 of them
should fail with a duplicate key error and one should succeed.

I have exactly this problem. Or nearly so.

I am counting hits from incoming IP addresses.
They invented an extended INSERT statement just for your problem.
It works a bit backwards from your scenario: try an insert. If
it fails because the record already exists, do an update on the
existing record. But it does what you want: update the record,
creating it if necessary. And since it's done in one query,
no transactions or extra locking required.

Quote:
The logic is

Get the record for this IP address.

If successful, increment the hits counter, record time

ELSE create it with counter as one (and some other stuff)

Sometimes I get duplicate records..

It's not a real world problem (records are for human analysis only) ,
but I'd be curious to know a suitable strategy to avoid it.
Assuming you have 4 fields in the table hitlog:
ip_address (varchar), count (int), first_contact (datetime)
and last_contact (datetime).
There is a unique index (or primary key) on ip_address.

INSERT INTO hitlog(ip_address, count, first_contact, last_contact)
VALUES ('$ip', 1, now(), now())
ON DUPLICATE KEY
UPDATE count = count+1, last_contact = now();

Here, $ip is replaced with an appropriate value, say, by PHP.
Note that it's ONE query.

Reply With Quote
  #10  
Old   
The Natural Philosopher
 
Posts: n/a

Default Re: Parallel access to the values of a table - 08-23-2011 , 09:52 AM



Gordon Burditt wrote:
Quote:
If at 12:01:00.000000,
1000 clients issue 1000 queries looking for the word "aint", don't
find it, and 1 millisecond later at 12:01:00.001000, 1000 clients
issue 1000 queries to insert the word "aint", exactly 999 of them
should fail with a duplicate key error and one should succeed.
I have exactly this problem. Or nearly so.

I am counting hits from incoming IP addresses.

They invented an extended INSERT statement just for your problem.
It works a bit backwards from your scenario: try an insert. If
it fails because the record already exists, do an update on the
existing record. But it does what you want: update the record,
creating it if necessary. And since it's done in one query,
no transactions or extra locking required.

The logic is

Get the record for this IP address.

If successful, increment the hits counter, record time

ELSE create it with counter as one (and some other stuff)

Sometimes I get duplicate records..

It's not a real world problem (records are for human analysis only) ,
but I'd be curious to know a suitable strategy to avoid it.

Assuming you have 4 fields in the table hitlog:
ip_address (varchar), count (int), first_contact (datetime)
and last_contact (datetime).
There is a unique index (or primary key) on ip_address.

INSERT INTO hitlog(ip_address, count, first_contact, last_contact)
VALUES ('$ip', 1, now(), now())
ON DUPLICATE KEY
UPDATE count = count+1, last_contact = now();

Here, $ip is replaced with an appropriate value, say, by PHP.
Note that it's ONE query.

Sadly there is a separate key..

+-------------+------------------+------+-----+---------+----------------+
Quote:
Field | Type | Null | Key | Default | Extra |
+-------------+------------------+------+-----+---------+----------------+
id | int(10) unsigned | NO | PRI | NULL | auto_increment |
ipaddr | tinytext | YES | | NULL | |
hits | int(10) unsigned | YES | | NULL | |
whois | varchar(1024) | YES | | NULL | |
last_access | datetime | YES | | NULL | |
+-------------+------------------+------+-----+---------+----------------+
5 rows in set (0.07 sec)

That probably scuppers that approach unless I redesign the table. :-(

So ON DUPLICATE(ipaddr) is not allowed huh? :-)

I hate table locks...

Reply With Quote
Reply




Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off



Powered by vBulletin Version 3.5.3
Copyright ©2000 - 2012, Jelsoft Enterprises Ltd.