|
|
|
|
|
|
| Author |
Message |
bostic@sleepycat.com *nix forums beginner
Joined: 21 Jun 2005
Posts: 49
|
Posted: Fri Mar 11, 2005 2:25 pm Post subject:
Re: Preallocate backing file for the Berkeley DB cache
|
|
|
| Quote: | The file system is pretty full (over 70%), so no long streaks
of continuous blocks are not available. It's much better than
the original version (with 8K increments).
|
OK, I'm convinced. I've submitted code changes for Berkeley
DB to ensure we don't fragment when pre-allocating underlying
shared region files. This change will be part of the upcoming
DB 4.4 release, tracked in our Support Request #12125.
Thanks for finding this one!
Regards,
--keith
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Keith Bostic bostic@sleepycat.com
Sleepycat Software Inc. keithbosticim (Yahoo)
118 Tower Rd. +1-781-259-3139
Lincoln, MA 01773 http://www.sleepycat.com |
|
| Back to top |
|
 |
Florian Weimer *nix forums Guru
Joined: 19 Feb 2005
Posts: 418
|
Posted: Tue Mar 08, 2005 6:11 pm Post subject:
Re: Preallocate backing file for the Berkeley DB cache
|
|
|
* Florian Weimer:
| Quote: | By the way, with recent debugfs versions, you need a patch to print
the actual block numbers in most indirect blocks:
|
Or you can use the filefrag tool. *sigh*
It's much more straightforward to use. Do you need further
statistics? It seems that writing the file (with write(2)) could be
beneficial, but I would have to test this on a clean file system
(which I can't do right now). |
|
| Back to top |
|
 |
Florian Weimer *nix forums Guru
Joined: 19 Feb 2005
Posts: 418
|
Posted: Tue Mar 08, 2005 4:04 pm Post subject:
Re: Preallocate backing file for the Berkeley DB cache
|
|
|
| Quote: | However, OS_VMPAGESIZE is set to 8192 unconditionally (see
dbinc/region.h), and DB_REGION_INIT touches pointers in
OS_VMPAGESIZE increments. Many systems have a page size of
4096, so it actually makes things worse because it
practically *guarantees* fragmentation of the underlying file.
Have you actually seen this happen anywhere?
If so, on what operating system/filesystem combination?
|
On Linux 2.6 (x86, 4K page size) with ext3fs (4K block size), the file
is created with holes:
Inode: 6963441 Type: regular Mode: 0640 Flags: 0x0 Generation: 2984990891
User: 1000 Group: 1000 Size: 262152192
File ACL: 0 Directory ACL: 0
Links: 1 Blockcount: 258032
Fragment: Address: 0 Number: 0 Size: 0
ctime: 0x422dc37f -- Tue Mar 8 16:23:43 2005
atime: 0x422dc36e -- Tue Mar 8 16:23:26 2005
mtime: 0x422dc36e -- Tue Mar 8 16:23:26 2005
BLOCKS:
(0):13944376, (2):13944377, (4):13944378, (6):13944379, ( :13944380,
(10):13944381, (IND):13944382, (12):13944383, (14):13944384,
(16):13944392, (1 :13944393, (20):13944394, (22):13944395,
(24):13944396, (26):13944397, (2 :13944398, (30):13944399,
(32):13944400, (34):13944401, (36):1394 4402, (3 :13944403,
(40):13944404, (42):13944405, (44):13944406, (46):13944407,
(4 :13944408, (50):13944409, (52):13944624, (54):13944625, (56)
:13944626, (5 :13944627, (60):13944628, (62):13944629, (64):13944630,
(66):13944631, (6 :13944632, (70):13944633, (72):13944634,
(74):13944635, (76):13944636, (7 :13944637, (80):13944638,
(82):13944639, (84):13944640, (86):13944648, (8 :13944649,
(90):13944650, (92):13944651, (94):1394 [...]
Notice that only even-numbered blocks are backed with file system
storage.
After using the database for a while, part of the cache has not yet
been touched. Blocks 1, 3, 5, and so on are still not allocated. Yet
towards the end of the file, all blocks are allocated:
Inode: 6963441 Type: regular Mode: 0640 Flags: 0x0 Generation: 2984990891
User: 1000 Group: 1000 Size: 262152192
File ACL: 0 Directory ACL: 0
Links: 1 Blockcount: 258032
Fragment: Address: 0 Number: 0 Size: 0
ctime: 0x422dc37f -- Tue Mar 8 16:23:43 2005
atime: 0x422dc36e -- Tue Mar 8 16:23:26 2005
mtime: 0x422dc36e -- Tue Mar 8 16:23:26 2005
BLOCKS:
(0):13944376, (2):13944377, (4):13944378, (6):13944379, ( :13944380,
(10):13944381, (IND):13944382, (12):13944383, (14):13944384,
(16):13944392, (1 :13944393, (20):13944394, (22):13944395,
(24):13944396, (26):13944397, (2 :13944398, (30):13944399,
(32):13944400, (34):13944401, (36):13944402, (3 :13944403,
(40):13944404, (42):13944405, (44):13944406, (46):13944407,
(4 :13944408, (50):13944409, (52):13944624, (54):13944625, (56)
(:13944626, (5 :13944627, (60):13944628, (62):13944629,
((64):13944630, 66):13944631, (6 :13944632, (70):13944633,
((72):13944634, 74):13944635, (76):13944636, (7 :13944637,
((80):13944638, 82):13944639, (84):13944640, (86):13944648,
((8 :13944649, 90):13944650, (92):13944651, (94):13944652,
((96):13944653, 9 :13944654, (100):13944655, (102):13944656,
((104):13944657, 106):13944658, (108):13944659, (110):13944660,
((112):13944661, 114):13944662, (116):13944663, (118):13944664,
((120):13944665, 122):13944666, (124):13944667, (126):13944668,
((128):13944669,
[...]
(63932):13968103, (63933):13978950, (63934):13968104, 5):13977853,
(6393(63936):13968105, (63937):13978951, :13968106,
(6393(63939):13977854, (63940):13968107, (63941):13979009,
(6393(63942):13968108, (63943):13977855, (63944):13968109,
(6393(63945):13979010, (63946):13968110, (63947):13977856,
(6393(63948):13968111, (63949):13979011, (63950):13968112,
(6393(63951):13977864, (63952):13968113, (63953):13979012,
(6393(63954):13968114, (63955):13977872, (63956):13968115,
(6393(63957):13979013, (63958):13968116, (63959):13977880,
(6393(63960):13968117, (63961):13979014, (63962):13968118, (63963)
(6393:13977888, (63964):13968119, (63965):13979015, (63966):13968120,
(6393(63967):13977896, (63968):13968121, (63969):13979073,
(6393(63970):13968122, (63971): 13977904, (63972):13968123,
(6393(63973):13979074, (63974):13968124, (63975):13977857,
(6393(63976):13968125, (63977):13979075, (63978):13968126,
(6393(63979):13977858, (63980):13968127, (63981):13979076,
(6393(63982):13968128, (63983):13977859, (63984):13968136,
(6393(63985):13979077, (63986):13968137, (63987):13977860,
(6393(63988):13968138, (63989):13979078, (63990):13968139,
(6393(63991):13977861, (63992):13968140, (63993):13979079,
(6393(63994):13968141, (63995):13977862, (63996):13968142,
(6393(63997):13979137, (63998):13968143, (63999):13977863,
(6393(64000-64001):13943684-13943685
As you can see, the physical block numbers (after the colons) are in
pretty random order.
| Quote: | Does changing OS_VMPAGESIZE to 4KB make a difference on that
system?
|
I hope I correctly made this change.
Inode: 6963441 Type: regular Mode: 0640 Flags: 0x0 Generation: 2985030886
User: 1000 Group: 1000 Size: 262148096
File ACL: 0 Directory ACL: 0
Links: 1 Blockcount: 512520
Fragment: Address: 0 Number: 0 Size: 0
ctime: 0x422dce4f -- Tue Mar 8 17:09:51 2005
atime: 0x422dd774 -- Tue Mar 8 17:48:52 2005
mtime: 0x422dce4f -- Tue Mar 8 17:09:51 2005
BLOCKS:
(0-11):13944385-13944396, (12-23):13944398-13944409,
(24-69):13944624-13944669, (70):13945011, (71-75):13945015-13945019,
(76-217):13945175-13945316, (218-293):13945352-13945427,
(294-296):13945441-13945443, (297-757):13945467-13945927,
(758-876):13945930-13946048, (877-1035):13946051-13946209,
(IND):13944397, (1036-1056):13946211-13946231,
(1057-1068):13946237-13946248, (1069-1090):13946254-13946275,
(1091-1115):13946280-13946304, (1116-1172):13946312-13946368,
(1173-1229):13946376-13946432, (1230-1271):13946440-13946481,
(1272-1280):13946488-13946496, (1281-1337):13946504-13946560,
(1338-1394):13946568-13946624, (1395-1451):13946632-13946688,
(1452-1508):13946696-13946752, (1509-1553):13946760-13946804,
(1554):13947191, (1555-1563):13947256-13947264,
(1564-1620):13947272-13947328, (1621-1629):13947336-13947344,
(1630-1642):13947349-13947361, (1643-1663):13947372-13947392,
(1664-1682):13947400-13947418, (1683):13947426, (1684):13947434,
(1685):13947438, (1686-1687):13947440-13947441,
[...]
(60428-60431):3190676-3190679, (60432-60438):3190681-3190687,
(60439-60441):3190693-3190695, (60442-60448):3190697-3190703,
(60449-60450):3190710-3190711, (60451-60457):3190713-3190719,
(60458-60463):3191178-3191183, (60464-60470):3191185-3191191,
(60471-60472):3191286-3191287, (60473-60479):3191289-3191295,
(60480-60486):3191617-3191623, (60487-60493):3191625-3191631,
(60494-60500):3191633-3191639, (60501-60505):3191643-3191647,
(60506-60512):3191649-3191655, (60513-60519):3191657-3191663,
(60520):3199666, (60521):3199669, (60522-60730):3199792-3200000,
(60731-60736):3200002-3200007, (60737-60743):3200065-3200071,
(60744-60750):3200129-3200135, (60751):3200192,
(60752-60764):3200194-3200206, (60765-60769):3200211-3200215,
(60770-60776):3200257-3200263, (60777-60783):3200321-3200327,
(60784-60790):3200385-3200391, (60791-60797):3200449-3200455,
(60798-60804):3200513-3200519, (60805-60811):3200577-3200583,
(60812-60817):3200594-3200599, (60818-60824):3200641-3200647,
(60825-61451):3201240-3201866, (IND):3190675,
(61452-61639):3201868-3202055, (61640-61647):3202624-3202631,
(61648-61654):3202689-3202695, (61655-62475):3203184-3204004,
(IND):3201867, (62476-63141):3204006-3204671,
(63142-63499):3204680-3205037, (IND):3204005,
(63500-63581):3205038-3205119, (63582-63999):3205152-3205569,
(64000):13944379, (IND):13944378, (DIND):13944377
TOTAL: 64065
The file system is pretty full (over 70%), so no long streaks of
continuous blocks are not available. It's much better than the
original version (with 8K increments). Actually, a non-sparse copy of
the same file looks pretty much similar.
By the way, with recent debugfs versions, you need a patch to print
the actual block numbers in most indirect blocks:
--- e2fsprogs-1.36.orig/debugfs/debugfs.c 2004-12-06 23:45:50.000000000 +0100
+++ e2fsprogs-1.36/debugfs/debugfs.c 2005-03-08 18:00:45.000000000 +0100
@@ -411,7 +411,7 @@
lb.first_block = 0;
lb.f = f;
lb.first = 1;
- ext2fs_block_iterate2(current_fs, inode, 0, NULL,
+ ext2fs_block_iterate2(current_fs, inode, BLOCK_FLAG_DEPTH_TRAVERSE, NULL,
list_blocks_proc, (void *)&lb);
finish_range(&lb);
if (lb.total) |
|
| Back to top |
|
 |
bostic@sleepycat.com *nix forums beginner
Joined: 21 Jun 2005
Posts: 49
|
Posted: Tue Mar 08, 2005 2:03 pm Post subject:
Re: Preallocate backing file for the Berkeley DB cache
|
|
|
| Quote: | However, OS_VMPAGESIZE is set to 8192 unconditionally (see
dbinc/region.h), and DB_REGION_INIT touches pointers in
OS_VMPAGESIZE increments. Many systems have a page size of
4096, so it actually makes things worse because it
practically *guarantees* fragmentation of the underlying file.
|
Have you actually seen this happen anywhere? If so, on what
operating system/filesystem combination?
Does changing OS_VMPAGESIZE to 4KB make a difference on that
system?
Regards,
--keith
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Keith Bostic bostic@sleepycat.com
Sleepycat Software Inc. keithbosticim (Yahoo)
118 Tower Rd. +1-781-259-3139
Lincoln, MA 01773 http://www.sleepycat.com |
|
| Back to top |
|
 |
Florian Weimer *nix forums Guru
Joined: 19 Feb 2005
Posts: 418
|
Posted: Sun Mar 06, 2005 5:20 pm Post subject:
Re: Preallocate backing file for the Berkeley DB cache
|
|
|
* Philip Guenther:
| Quote: | If the process that creates the environment sets the DB_REGION_INIT flag
before the DB_ENV->open() call, then the open will preallocate all the
region files, including the memory pool.
|
Ah, I missed that one. Thanks.
However, OS_VMPAGESIZE is set to 8192 unconditionally (see
dbinc/region.h), and DB_REGION_INIT touches pointers in OS_VMPAGESIZE
increments. Many systems have a page size of 4096, so it actually
makes things worse because it practically *guarantees* fragmentation
of the underlying file. 8-( |
|
| Back to top |
|
 |
Philip Guenther *nix forums beginner
Joined: 06 Mar 2005
Posts: 6
|
Posted: Sun Mar 06, 2005 4:50 pm Post subject:
Re: Preallocate backing file for the Berkeley DB cache
|
|
|
Florian Weimer <fw@deneb.enyo.de> writes:
| Quote: | Currently, the file backing the Berkeley DB cache is not preallocated
when it's created. Only a sparse file is created.
....
I think this could be avoided if the backing file is preallocated and
not just created as a sparse file.
|
If the process that creates the environment sets the DB_REGION_INIT flag
before the DB_ENV->open() call, then the open will preallocate all the
region files, including the memory pool.
(Don't forgot that you can make that change via a DB_CONFIG file...)
Philip Guenther |
|
| Back to top |
|
 |
Florian Weimer *nix forums Guru
Joined: 19 Feb 2005
Posts: 418
|
Posted: Sat Mar 05, 2005 9:01 pm Post subject:
Preallocate backing file for the Berkeley DB cache
|
|
|
Currently, the file backing the Berkeley DB cache is not preallocated
when it's created. Only a sparse file is created. This means that
most file systems create a heavily fragmented backing file over time,
when more and more data is actually written to disk. If the
application which uses Berkeley DB terminates, recent Linux 2.6
versions start to immediately write the backing file. Because of its
heavy fragmentation, this write operation is rather slow.
I think this could be avoided if the backing file is preallocated and
not just created as a sparse file. (I still have to run a simulation,
to check if this is really the case, though.) |
|
| Back to top |
|
 |
Google
|
|
| Back to top |
|
 |
|
|
The time now is Thu Jan 08, 2009 4:30 am | All times are GMT
|
|
Loans | Bankruptcy | Debt Consolidation | Mortgage Calculator | Problem Mortgage
|
|
Copyright © 2004-2005 DeniX Solutions SRL
|
|
|
|
Other DeniX Solutions sites:
Unix/Linux blog |
electronics forum |
medicine forum |
science forum |
|
|
Privacy Policy
|
Powered by phpBB © 2001, 2005 phpBB Group
|
|