QM Distributed Files and maximum limits using GUID -
05-03-2010
, 02:31 PM
All,
I am now using QM's Distributed Files for the storage of session
records in Pavuk. This has been an interesting exercise since Martin
introduced the technology to me at Spectrum.
In the Pavuk web server, every successful login session is identifed
buy a Globally Unique Identifier or GUID. Please consult Wikipedia of
you are unfamiliar with these for a complete discussion on them. A
utility such as UUIDGEN is used to return a GUID which is a 33-byte
hex string such as: 30C7F94E-2ACC-40A9-B20A-C1DEEF14C8CB. The purpose
of GUIDs is to ensure that no system ever have a repeating key - EVER.
This makes the use of GUIDs very handy for things like logfiles.
Because the session IDs are stored perpetually (the records are
small), an efficient storage mechanism had to be devised for a
potential client who is looking to license 5000 users of Pavuk. I
chose to tackle this problem and learn about the QM distributed files.
Distributed files are made up of a collection of regular hash files. A
distributed file is a group of real hash files that have the same
structure. A DF may have 1 or more "part files" that is a hash file to
which it refers. For my example, I'll use the P.SESSIONS.DFL as my
Distributed File and P.SESSIONS.PART as my actual data storage area.
DFs are processed using their primary keys to sort the item into the
appropriate space. In my example, the GUIDs all begin with a hex
string and they are fairly evenly distributed across the 00-FF space
in the first byte. So, let's take the first byte and say we're going
to create 256 physical files. Of course, the files could be anywhere.
I chose to write a program to create the multi-file P.SESSIONS.PART,00
through P.SESSIONS.PART,FF. This provides the storage area.
Next, we need to create an I-TYPE dictionary in P.SESSIONS.PART that
will return a ***NUMERIC PART NUMBER*** based upon our record IDs.
We'll call this dictionary PARTNUMBER.
0001 I
0002 OCONV(@ID[1,2],"MCX")
0005 3R
This dictionary takes the first 2-character hex portion of the file
and returns a number from 0 - 255. 0 - 00, 1b=27, FF=255.
This handy dictionary is used by the Distributed File processor to
find which file to store our data. Now, we need to tell QM to build
the Distributed File itself:
ADD.DF P.SESSIONS.DFL P.SESSIONS.PART,00 0 PARTNUMBER
This first time we run this command, the P.SESSIONS.DFL control
component will be created. We are assigning the P.SESSIONS.PART,00
based upon the Hex ID to be component 0. We're also telling QM that
the method used to compute the part number is the dictionary
"PARTNUMBER"
Next we do:
ADD.DF P.SESSIONS.DFL P.SESSIONS.PART,01 1
Note, we don't have to tell QM the dictionary every time.
We're done at:
ADD.DF P.SESSIONS.DFL P.SESSIONS.PART,FF 255
Of course, I wrote a BASIC program to build this in a loop!
I now have a file that I can OPEN and use called P.SESSIONS.DFL and it
is comprised of 256 physical hash files. Because each QM hash file may
be up to 16,384GB, this creates a file that is theoretically capable
of holding 4.19 Exabytes of data. That's 4,194,304,000,000,000,000
bytes of data. More than enough for my purposes!
If I actually LIST P.SESSIONS.PART,00, I see the session records that
begin with 00. LIST P.SESSIONS.PART,3A shows those beginning with 3A,
etc.
This is a massive, yet simple way to deploy DFs. There is an easy 1-1
correlation between the first byte of the primary key and the storage
location. Remember, the actual index to find the storage location is
an integer beginning with 0.
***
Performance note: If you are going to use massive files like this, I
strongly recommend that the NUMFILES setting in your qmconfig file be
larger than the number of parts of the DF. I set my NUMFILES to 400
and the performance is excellent. This has to do with the number of
physical file handles present.
For Mac Server users, I also recommend that you do
$sudo lauchctl limit maxfiles 4096
Upon system restart. OS X 10.6 (Snow Leopard) has the per-process file
handle limit set absurdly low.
***
The way that you distribute your files is a function of your
imagination and the structure of the primary keys.
In Pavuk 2.0, the archival files will have the option to switch from a
normal hash file to a DF as needed.
Bill Crowell |