|
|
|
|
|
|
| Author |
Message |
nicolas.peyrussie@laposte *nix forums beginner
Joined: 16 Mar 2005
Posts: 2
|
Posted: Thu Mar 17, 2005 7:19 am Post subject:
Re: performance in read access
|
|
|
I am going to try the second solution with the cursor access.
For the "db_stat -m" I have an error message, even if my program is
running:
db_stat: DB_ENV->open: No such file or directory
I will try to find later how to use this command.
Else I am on a Fedora Core 3.
Thank you for your help.
Regards,
Nicolas |
|
| Back to top |
|
 |
Michael Cahill *nix forums Guru Wannabe
Joined: 26 May 2005
Posts: 219
|
Posted: Wed Mar 16, 2005 10:31 pm Post subject:
Re: performance in read access
|
|
|
Hi Nicolas,
What you're seeing here is performance that depends critically on the
cache. If most of the pages that your application needs are in cache,
it will run faster. If not, there will be more I/O, and your
application will run slower.
One complicating factor is that there are really two caches (at least):
the one maintained by Berkeley DB, and the filesystem buffer cache
maintained by the operating system. You didn't mention what operating
system you're using in your message, but the cache behavior varies
quite a bit between, say, Solaris and Windows.
The first thing you should know about is the cache statistics that
Berkeley DB gathers. If you run the command "db_stat -m" in the
environment directory, you will see the percentage of pages that are
found in cache. The higher this number, the better.
For this to work, you'll need to open the database in an environment,
which will maintain the cache across accesses to the database. That in
itself should improve performance, as there is likely to be some
locality between runs of your code.
The simplest way to improve cache performance is usually to increase
the size of the cache. I don't understand why having 100 separate
cache regions would help, as you configured here:
database->set_cachesize(database,0,67108864,100);
Another way to improve cache performance is to order your operations.
For example, if your calculation is independent of the order of tokens
in the documents, try sorting them before reading from the database.
You could do this simply by first adding them to a in-memory database
(one with a NULL name), then scanning that database with a cursor to
lookup the values from the "real" database.
Regards,
Michael. |
|
| Back to top |
|
 |
nicolas.peyrussie@laposte *nix forums beginner
Joined: 16 Mar 2005
Posts: 2
|
Posted: Wed Mar 16, 2005 1:57 pm Post subject:
performance in read access
|
|
|
Hello,
I am currently developping a software in C that tokenizes (thanks to
flex) html pages before storing the tokens obtained in a BerkeleyDB (a
BTREE).
This can be understood as the learning phase.
Then for the test phase, I tokenize the same way an html page, and for
each token I have to retrieve them from the base to give them a score
and then give a probability to the page for being Bad or Good (porn or
non porn in fact).
The trouble is that, for the second phase, the program is really slow
to give me final probability.
For the learning phase my best results were obtained doing :
database->get_cachesize(database,&a1,&a2,&a3);
database->set_cachesize(database,0,a2*a2,1);
Thus, the program is really fast to tokenize and store tokens in the
database.
But for the test phase, whatever the number of pages I use or the size
of the cache, I have the same slow result. Though my best results are
obtained doing :
database->get_cachesize(database,&a1,&a2,&a3);
database->set_cachesize(database,0,67108864,100);
To create the database, I tokenized around 4 000 html pages and I have
1920282 entries in my database.
When I do a "top" while my program is running on a directory (i.e. to
get the probabilities of all the files in a directory) I can notice
that I have a lot of I/O wait. Moreover the percentage of cpu used is
quite low.
Last point; if I stop the program and start it again, it is really fast
to give me results for the pages which were already scored before I
stopped.
Does that mean that are stored in memory only the tokens I get from the
database (with DB->get(...)) ?
Does someone have a solution ?
I thank you in advance.
Regards,
Nicoals |
|
| Back to top |
|
 |
Google
|
|
| Back to top |
|
 |
|
|
The time now is Thu Jan 08, 2009 4:42 am | All times are GMT
|
|
Remortgaging | McDonalds | Debt Consolidation | Bankruptcy | Download Anime
|
|
Copyright © 2004-2005 DeniX Solutions SRL
|
|
|
|
Other DeniX Solutions sites:
Unix/Linux blog |
electronics forum |
medicine forum |
science forum |
|
|
Privacy Policy
|
Powered by phpBB © 2001, 2005 phpBB Group
|
|