niXforums Forum Index
 FAQFAQ   SearchSearch   MemberlistMemberlist   UsergroupsUsergroups   RegisterRegister 
 ProfileProfile   PreferencesPreferences   Log in to check your private messagesLog in to check your private messages   Log inLog in 
·  nixdoc.net ·  man pages ·  Linux HOWTOs ·  FreeBSD Tips ·  Forums
navigation Forum index » Databases » Berkeley DB
performance in read access
Post new topic   Reply to topic Page 1 of 1 [3 Posts] View previous topic :: View next topic
Author Message
nicolas.peyrussie@laposte
*nix forums beginner


Joined: 16 Mar 2005
Posts: 2

PostPosted: Thu Mar 17, 2005 7:19 am    Post subject: Re: performance in read access Reply with quote

I am going to try the second solution with the cursor access.
For the "db_stat -m" I have an error message, even if my program is
running:
db_stat: DB_ENV->open: No such file or directory
I will try to find later how to use this command.

Else I am on a Fedora Core 3.

Thank you for your help.
Regards,
Nicolas
Back to top
Michael Cahill
*nix forums Guru Wannabe


Joined: 26 May 2005
Posts: 219

PostPosted: Wed Mar 16, 2005 10:31 pm    Post subject: Re: performance in read access Reply with quote

Hi Nicolas,

What you're seeing here is performance that depends critically on the
cache. If most of the pages that your application needs are in cache,
it will run faster. If not, there will be more I/O, and your
application will run slower.

One complicating factor is that there are really two caches (at least):
the one maintained by Berkeley DB, and the filesystem buffer cache
maintained by the operating system. You didn't mention what operating
system you're using in your message, but the cache behavior varies
quite a bit between, say, Solaris and Windows.

The first thing you should know about is the cache statistics that
Berkeley DB gathers. If you run the command "db_stat -m" in the
environment directory, you will see the percentage of pages that are
found in cache. The higher this number, the better.

For this to work, you'll need to open the database in an environment,
which will maintain the cache across accesses to the database. That in
itself should improve performance, as there is likely to be some
locality between runs of your code.

The simplest way to improve cache performance is usually to increase
the size of the cache. I don't understand why having 100 separate
cache regions would help, as you configured here:

database->set_cachesize(database,0,67108864,100);

Another way to improve cache performance is to order your operations.
For example, if your calculation is independent of the order of tokens
in the documents, try sorting them before reading from the database.
You could do this simply by first adding them to a in-memory database
(one with a NULL name), then scanning that database with a cursor to
lookup the values from the "real" database.

Regards,
Michael.
Back to top
nicolas.peyrussie@laposte
*nix forums beginner


Joined: 16 Mar 2005
Posts: 2

PostPosted: Wed Mar 16, 2005 1:57 pm    Post subject: performance in read access Reply with quote

Hello,

I am currently developping a software in C that tokenizes (thanks to
flex) html pages before storing the tokens obtained in a BerkeleyDB (a
BTREE).
This can be understood as the learning phase.

Then for the test phase, I tokenize the same way an html page, and for
each token I have to retrieve them from the base to give them a score
and then give a probability to the page for being Bad or Good (porn or
non porn in fact).

The trouble is that, for the second phase, the program is really slow
to give me final probability.

For the learning phase my best results were obtained doing :
database->get_cachesize(database,&a1,&a2,&a3);
database->set_cachesize(database,0,a2*a2,1);
Thus, the program is really fast to tokenize and store tokens in the
database.

But for the test phase, whatever the number of pages I use or the size
of the cache, I have the same slow result. Though my best results are
obtained doing :
database->get_cachesize(database,&a1,&a2,&a3);
database->set_cachesize(database,0,67108864,100);

To create the database, I tokenized around 4 000 html pages and I have
1920282 entries in my database.

When I do a "top" while my program is running on a directory (i.e. to
get the probabilities of all the files in a directory) I can notice
that I have a lot of I/O wait. Moreover the percentage of cpu used is
quite low.
Last point; if I stop the program and start it again, it is really fast
to give me results for the pages which were already scored before I
stopped.

Does that mean that are stored in memory only the tokens I get from the
database (with DB->get(...)) ?
Does someone have a solution ?

I thank you in advance.
Regards,
Nicoals
Back to top
Google

Back to top
Display posts from previous:   
Post new topic   Reply to topic Page 1 of 1 [3 Posts] View previous topic :: View next topic
The time now is Thu Jan 08, 2009 4:42 am | All times are GMT
navigation Forum index » Databases » Berkeley DB
Jump to:  

Similar Topics
Topic Author Forum Replies Last Post
No new posts [554] Relay access denied annami Postfix 0 Tue Oct 21, 2008 9:12 am
No new posts [554] Relay access denied annami Postfix 0 Tue Oct 21, 2008 9:11 am
No new posts [554] Relay access denied annami Postfix 0 Tue Oct 21, 2008 9:10 am
No new posts Performance and Consistency ?? likun.navipal@gmail.com Berkeley DB 4 Fri Jul 21, 2006 4:24 am
No new posts AIX performance tuning jpzhai@gmail.com AIX 5 Fri Jul 21, 2006 2:27 am

Remortgaging | McDonalds | Debt Consolidation | Bankruptcy | Download Anime
Copyright © 2004-2005 DeniX Solutions SRL
 
Other DeniX Solutions sites: Unix/Linux blog |  electronics forum |  medicine forum |  science forum | 
Privacy Policy


Powered by phpBB © 2001, 2005 phpBB Group
[ Time: 0.2560s ][ Queries: 20 (0.1456s) ][ GZIP on - Debug on ]