|
|
|
|
|
|
| Author |
Message |
Gordon Burditt *nix forums Guru
Joined: 02 Mar 2005
Posts: 773
|
Posted: Tue Feb 28, 2006 6:41 am Post subject:
Re: Whats the practical maximum file size using indexed allocation (I nodes)
|
|
|
| Quote: | Yes but it must be added, the question itself seems to reveal a lack of
understanding. True, Unix (what's a *NIX anyway?) directory entries have
this limitation but that's like saying potatoes will fall if you drop
them from a tower - yes, and so will anything else. All directory
entries, Unix or not, are inherently limited to references within their
own filesystem, just as pointers in a program cannot point into another
program's address space and a table in a relational database cannot
contain data from another database.
|
Symlinks are a counterexample to this. (Yes, in some implementations,
a symlink is a directory entry with no associated inode.)
Some relational databases (Oracle included, as I understand it) permit
joining two tables on different servers in two different countries.
MySQL has the Federated storage engine. "A Federated table acts as
a pointer to an actual table object that exists on the same or
another server."
I see nothing inherently impossible about using a URL as a
filename to refer to remote objects. PHP accepts these as
an argument to fopen(). I think I've heard of someone doing
this sort of thing at the filesystem level, especially with
FTP. And who says something like:
ln -s http://www.microsoft.com/ xyz
is impossible?
I see nothing *INHERENTLY* impossible about defining a Universal Byte
Pointer (UBP) which can point to any byte. Anywhere. Subfields of
such a pointer might include:
IPv8 machine address (256-kbits)
Disk number (128 bits)
Partition number (128 bits)
Cylinder number (128 bits)
Head number (128 bits)
Sector number (128 bits)
Offset (128 bits)
Alternate Universe number (4-kbits)
Galaxy number (128 bits)
Planetary system code (128 bits)
| Quote: | These are not quirks of
implementation; they are requirements of reality.
|
The problem I see is that if you wish to refer to another filesystem,
how do you reasonably IDENTIFY that file system. Is it always the
filesystem in partition e of the disk mounted on THIS drive? Is
it always the filesystem with this volume label? Even if there are
two such filesystems mounted now? How about if it's not mounted
at all now? How do you keep the references sane when disks are
mounted in nonstandard places, or adding a controller causes
re-numbering of the existing ones?
Gordon L. Burditt |
|
| Back to top |
|
 |
Nick Roberts *nix forums beginner
Joined: 12 Feb 2006
Posts: 5
|
Posted: Tue Feb 28, 2006 6:58 pm Post subject:
Multi-volume integrated storage
|
|
|
Gordon Burditt wrote:
| Quote: | The problem I see is that if you wish to refer to another filesystem,
how do you reasonably IDENTIFY that file system. Is it always the
filesystem in partition e of the disk mounted on THIS drive? Is
it always the filesystem with this volume label? Even if there are
two such filesystems mounted now? How about if it's not mounted
at all now? How do you keep the references sane when disks are
mounted in nonstandard places, or adding a controller causes
re-numbering of the existing ones?
|
These are all questions I would like answered. I would like my file
system (Aquila) to be able to store files transparently in multiple
physical volumes (in permanent storage devices all connected to one
workstation).
Aquila has two quirks, in that 'files' are called 'stores' and each
store is identified by a 32-bit or 64-bit number, a 'store ID', rather
than a name. A store ID is very similar to an inode number. However, to
keep the confusion factor down, I'll talk about files and file IDs here.
A workstation may well have more than one physical storage device
(generally hard disks). It would be nice to be able to have one
'logical' Aquila volume span multiple physical 'sub-volumes' (disks or
maybe partitions). There is one special scenario in which this would be
very advantageous: if we want to store a file that is bigger than any
one disk. (There are quirks of how AdaOS works that actually make this a
likely scenario.)
But there are certainly quite a number of problems with the idea. In
addition to the problems Gordon points out:
How do I allocate blocks to files? Do I try to keep them all in one
physical sub-volume, or, conversely, do I try to spread them around as
much as possible?
In a similar vein, which sub-volume do I start allocating blocks in at
first? Do I try to fill up one before using the others? If so, which
one? (The biggest? The smallest? The fastest?)
What about block numbering: do I start at 0 with the first sub-volume,
and then continue numbering at the next sub-volume, and so on? Or do I
have to encode a sub-volume number along with every block number? Must
the sub-volumes be of fixed size? (I presume so.)
What do I do (this seems to be a point that Gordon was making) if one of
the disks goes up to the big supercomputer in the sky? Do I need to
store redundant copies of file blocks? In which case, which ones? (The
index blocks, the data blocks, everything?) I suppose all important
meta-data (superblocks and suchlike) must be duplicated across all
sub-volumes.
I think that one could allocate a unique ID to each sub-volume, when
formatting it, and the uber-meta-data (superblock?) could contain a list
of the identifiers of all the sub-volumes that make up the complete
logical volume. That helps. But the whole scheme does seem a little bit
vulnerable to disk failure: one disk goes bang and you're buggered.
There is also the whole business of striping, to which some of these
questions also apply.
How do other file systems solve (or approach) these problems?
--
Nick Roberts
PS: I hope the change of subject line is okay. |
|
| Back to top |
|
 |
Gordon Burditt *nix forums Guru
Joined: 02 Mar 2005
Posts: 773
|
Posted: Tue Feb 28, 2006 8:36 pm Post subject:
Re: Multi-volume integrated storage
|
|
|
| Quote: | The problem I see is that if you wish to refer to another filesystem,
how do you reasonably IDENTIFY that file system. Is it always the
filesystem in partition e of the disk mounted on THIS drive? Is
it always the filesystem with this volume label? Even if there are
two such filesystems mounted now? How about if it's not mounted
at all now? How do you keep the references sane when disks are
mounted in nonstandard places, or adding a controller causes
re-numbering of the existing ones?
|
Part of the point here is that I would like to be able to do
a number of things with filesystems and not have the references horribly
screwed up. For example I want to be able to:
1. Mount two filesystems (we'll call them MASTER and BACKUP) and
copy from one to the other. That means I have to be able to tell
the difference between them. The copies I make might be with
tar or cp -r, so I can't count on inodes being the same between the
two. But I have to be able to mount them at the same time.
2. Have references from some other filesystem into MASTER work.
3. MASTER gets screwed up, so I unmount it and replace it with BACKUP.
The references in #2 need to work with corresponding files on
BACKUP without changing them. That means filesystem identification
by media serial number won't work here since that would
be different for MASTER and BACKUP.
| Quote: | These are all questions I would like answered. I would like my file
system (Aquila) to be able to store files transparently in multiple
physical volumes (in permanent storage devices all connected to one
workstation).
Aquila has two quirks, in that 'files' are called 'stores' and each
store is identified by a 32-bit or 64-bit number, a 'store ID', rather
than a name. A store ID is very similar to an inode number. However, to
keep the confusion factor down, I'll talk about files and file IDs here.
A workstation may well have more than one physical storage device
(generally hard disks). It would be nice to be able to have one
'logical' Aquila volume span multiple physical 'sub-volumes' (disks or
maybe partitions). There is one special scenario in which this would be
very advantageous: if we want to store a file that is bigger than any
one disk. (There are quirks of how AdaOS works that actually make this a
likely scenario.)
|
Wacky definition: partition: a chunk of storage you put a filesystem on.
This is often done with the intermediate stage of a virtual partition.
A virtual partition looks like a big chunk of disk with
sequentially-numbered blocks, regardless of whether the actual
storage is spread over multiple drives. A virtual partition may
be RAIDed: any given block is actually stored in multiple places
for redundancy. It might be striped for higher access speed.
Or both.
There are a number of advantages for the filesystem to NOT know
about the underlying structure of the partition.
| Quote: | But there are certainly quite a number of problems with the idea. In
addition to the problems Gordon points out:
How do I allocate blocks to files? Do I try to keep them all in one
physical sub-volume, or, conversely, do I try to spread them around as
much as possible?
|
For speed, you might want to interleave the blocks: even blocks
are on the first device, odd blocks are on the second. This
reduces seek times and spreads the load over both drives fairly
evenly. Or you might just number the blocks from 0 on the first
drive, continuing from the end of the first drive to the beginning
of the second.
Your filesystem has a strategy already for working on a real disk,
I presume. Something like putting blocks of the same file together
and spreading out different files? The same can apply on a virtual
partition if the block numbering is done so that going from accessing
a block close to block N (which you accessed last) has a better chance
of being faster than access from something far away from block N.
| Quote: | In a similar vein, which sub-volume do I start allocating blocks in at
first? Do I try to fill up one before using the others? If so, which
one? (The biggest? The smallest? The fastest?)
What about block numbering: do I start at 0 with the first sub-volume,
and then continue numbering at the next sub-volume, and so on? Or do I
have to encode a sub-volume number along with every block number? Must
the sub-volumes be of fixed size? (I presume so.)
|
To allow image copying of filesystems, I'd prefer to see block numbering
from 0 to the size of the filesystem -1 with no visibility of the
underlying structure (which might change). That makes it possible
to copy from a single-disk filesystem to a 3-way RAID setup with
'dd' (once you configure the virtual partition).
| Quote: | What do I do (this seems to be a point that Gordon was making) if one of
the disks goes up to the big supercomputer in the sky? Do I need to
store redundant copies of file blocks? In which case, which ones? (The
|
Do you need redundancy? On /tmp, probably not. On the filesystem
with your database on it, probably. It's an economic tradeoff.
How much do redundant disks cost? How much do employees making
backups (and the media they use) cost? What is the cost impact of
downtime? If you need redundancy, you probably need to store
redundant copies of *all* of the blocks.
This is often done by making a virtual partition redundant and
the filesystem code need know nothing about the redundancy.
It simply writes a block. It need not know that this translates
to several writes on several drives.
| Quote: | index blocks, the data blocks, everything?) I suppose all important
meta-data (superblocks and suchlike) must be duplicated across all
sub-volumes.
I think that one could allocate a unique ID to each sub-volume, when
formatting it, and the uber-meta-data (superblock?) could contain a list
of the identifiers of all the sub-volumes that make up the complete
logical volume. That helps. But the whole scheme does seem a little bit
vulnerable to disk failure: one disk goes bang and you're buggered.
|
You need to safely store the configuration of your virtual partition
since if it is reassembled incorrectly (disks in wrong order,
redundant vs. not redundant, etc.) you get a complete mess.
| Quote: | There is also the whole business of striping, to which some of these
questions also apply.
How do other file systems solve (or approach) these problems?
|
I prefer to not have the *file system* solve this problem.
You have a virtual partition (chunk of disk space) with properties
you like. Then put a filesystem on it.
Gordon L. Burditt |
|
| Back to top |
|
 |
Jose R. Valverde *nix forums beginner
Joined: 01 Mar 2006
Posts: 1
|
Posted: Wed Mar 01, 2006 8:34 pm Post subject:
Re: Multi-volume integrated storage
|
|
|
In article <1209d253f9qm9b4@corp.supernews.com>,
gordonb.2932i@burditt.org (Gordon Burditt) writes:
| Quote: | Part of the point here is that I would like to be able to do
a number of things with filesystems and not have the references horribly
screwed up. For example I want to be able to:
Most of this is easy by just using symbolic links. |
| Quote: | 1. Mount two filesystems (we'll call them MASTER and BACKUP) and
copy from one to the other. That means I have to be able to tell
the difference between them. The copies I make might be with
tar or cp -r, so I can't count on inodes being the same between the
two. But I have to be able to mount them at the same time.
Easiest way is to make a mirroring RAID system and forget about all |
other problems.
Actually many of the issues discussed in this thread are easily
solvable by RAID: want a file larger than any one disk? Concatenate
disks into a single logical volume. Want striping? Other things?
The main problem and the one that seems harder is that of consolidating
resources among separate computers. RAID over remote block devices may
be a solution, but then comes the issue of resilience.
Ideally one would like to have a distributed RAID system with striping
and bigger than two replication/mirroring or various distributed
spare devices for checksums, so that if any one computer/device
fails there may be recovery mechanisms.
| Quote: | 2. Have references from some other filesystem into MASTER work.
|
symlinks
| Quote: | 3. MASTER gets screwed up, so I unmount it and replace it with BACKUP.
The references in #2 need to work with corresponding files on
BACKUP without changing them. That means filesystem identification
by media serial number won't work here since that would
be different for MASTER and BACKUP.
symlinks again |
| Quote: | These are all questions I would like answered. I would like my file
system (Aquila) to be able to store files transparently in multiple
physical volumes (in permanent storage devices all connected to one
workstation).
RAID. |
| Quote: | Aquila has two quirks, in that 'files' are called 'stores' and each
store is identified by a 32-bit or 64-bit number, a 'store ID', rather
than a name. A store ID is very similar to an inode number. However, to
keep the confusion factor down, I'll talk about files and file IDs here.
A workstation may well have more than one physical storage device
(generally hard disks). It would be nice to be able to have one
'logical' Aquila volume span multiple physical 'sub-volumes' (disks or
maybe partitions). There is one special scenario in which this would be
very advantageous: if we want to store a file that is bigger than any
one disk. (There are quirks of how AdaOS works that actually make this a
likely scenario.)
LVM and RAID |
| Quote: | Wacky definition: partition: a chunk of storage you put a filesystem on.
This is often done with the intermediate stage of a virtual partition.
A virtual partition looks like a big chunk of disk with
sequentially-numbered blocks, regardless of whether the actual
storage is spread over multiple drives. A virtual partition may
be RAIDed: any given block is actually stored in multiple places
for redundancy. It might be striped for higher access speed.
Or both.
There are a number of advantages for the filesystem to NOT know
about the underlying structure of the partition.
But there are certainly quite a number of problems with the idea. In
addition to the problems Gordon points out:
How do I allocate blocks to files? Do I try to keep them all in one
physical sub-volume, or, conversely, do I try to spread them around as
much as possible?
RAID |
| Quote: | For speed, you might want to interleave the blocks: even blocks
are on the first device, odd blocks are on the second. This
reduces seek times and spreads the load over both drives fairly
evenly. Or you might just number the blocks from 0 on the first
drive, continuing from the end of the first drive to the beginning
of the second.
Your filesystem has a strategy already for working on a real disk,
I presume. Something like putting blocks of the same file together
and spreading out different files? The same can apply on a virtual
partition if the block numbering is done so that going from accessing
a block close to block N (which you accessed last) has a better chance
of being faster than access from something far away from block N.
In a similar vein, which sub-volume do I start allocating blocks in at
first? Do I try to fill up one before using the others? If so, which
one? (The biggest? The smallest? The fastest?)
What about block numbering: do I start at 0 with the first sub-volume,
and then continue numbering at the next sub-volume, and so on? Or do I
have to encode a sub-volume number along with every block number? Must
the sub-volumes be of fixed size? (I presume so.)
To allow image copying of filesystems, I'd prefer to see block numbering
from 0 to the size of the filesystem -1 with no visibility of the
underlying structure (which might change). That makes it possible
to copy from a single-disk filesystem to a 3-way RAID setup with
'dd' (once you configure the virtual partition).
What do I do (this seems to be a point that Gordon was making) if one of
the disks goes up to the big supercomputer in the sky? Do I need to
store redundant copies of file blocks? In which case, which ones? (The
Do you need redundancy? On /tmp, probably not. On the filesystem
with your database on it, probably. It's an economic tradeoff.
How much do redundant disks cost? How much do employees making
backups (and the media they use) cost? What is the cost impact of
downtime? If you need redundancy, you probably need to store
redundant copies of *all* of the blocks.
This is often done by making a virtual partition redundant and
the filesystem code need know nothing about the redundancy.
It simply writes a block. It need not know that this translates
to several writes on several drives.
index blocks, the data blocks, everything?) I suppose all important
meta-data (superblocks and suchlike) must be duplicated across all
sub-volumes.
I think that one could allocate a unique ID to each sub-volume, when
formatting it, and the uber-meta-data (superblock?) could contain a list
of the identifiers of all the sub-volumes that make up the complete
logical volume. That helps. But the whole scheme does seem a little bit
vulnerable to disk failure: one disk goes bang and you're buggered.
You need to safely store the configuration of your virtual partition
since if it is reassembled incorrectly (disks in wrong order,
redundant vs. not redundant, etc.) you get a complete mess.
There is also the whole business of striping, to which some of these
questions also apply.
How do other file systems solve (or approach) these problems?
I prefer to not have the *file system* solve this problem.
You have a virtual partition (chunk of disk space) with properties
you like. Then put a filesystem on it.
Gordon L. Burditt
|
--
These opinions are mine and only mine. Hey man, I saw them first!
José R. Valverde
De nada sirve la Inteligencia Artificial cuando falta la Natural |
|
| Back to top |
|
 |
Maxim S. Shatskih *nix forums addict
Joined: 02 Apr 2005
Posts: 55
|
Posted: Wed Mar 01, 2006 8:47 pm Post subject:
Re: Whats the practical maximum file size using indexed allocation (I nodes)
|
|
|
| Quote: | And BTW, in the Unix world at least, the terms "volume", "partition",
"drive", and "filesystem" refer to potentially divergent things.
|
Isn't "volume" and "filesystem" synonims? Yes, "drive" and "partition" are
another things, but "volume" and "filesystem"?
--
Maxim Shatskih, Windows DDK MVP
StorageCraft Corporation
maxim@storagecraft.com
http://www.storagecraft.com |
|
| Back to top |
|
 |
Gordon Burditt *nix forums Guru
Joined: 02 Mar 2005
Posts: 773
|
Posted: Wed Mar 01, 2006 10:45 pm Post subject:
Re: Multi-volume integrated storage
|
|
|
| Quote: | Part of the point here is that I would like to be able to do
a number of things with filesystems and not have the references horribly
screwed up. For example I want to be able to:
Most of this is easy by just using symbolic links.
|
And part of the point of this discussion is that out-of-filesystem
hard links probably wouldn't work anything like symlinks, especially
if they used unique volume labels or hardware serial numbers to
distinguish between disks.
| Quote: | 1. Mount two filesystems (we'll call them MASTER and BACKUP) and
copy from one to the other. That means I have to be able to tell
the difference between them. The copies I make might be with
tar or cp -r, so I can't count on inodes being the same between the
two. But I have to be able to mount them at the same time.
Easiest way is to make a mirroring RAID system and forget about all
other problems.
|
Won't work. You still need backups. One slip of the finger
can mess up a database on all the copies of the RAID simultaneously
before you can even get your finger off the key you accidentally hit.
(This is also a problem with replicated databases.)
RAID is great for hardware failures, though.
| Quote: | Actually many of the issues discussed in this thread are easily
solvable by RAID: want a file larger than any one disk? Concatenate
disks into a single logical volume. Want striping? Other things?
|
Yes, I agree: put the RAID and striping and whatever in creating
a virtual partition, and then create a filesystem (which knows
nothing of the underlying setup) on top of that partition.
| Quote: | The main problem and the one that seems harder is that of consolidating
resources among separate computers. RAID over remote block devices may
be a solution, but then comes the issue of resilience.
Ideally one would like to have a distributed RAID system with striping
and bigger than two replication/mirroring or various distributed
spare devices for checksums, so that if any one computer/device
fails there may be recovery mechanisms.
2. Have references from some other filesystem into MASTER work.
symlinks
3. MASTER gets screwed up, so I unmount it and replace it with BACKUP.
The references in #2 need to work with corresponding files on
BACKUP without changing them. That means filesystem identification
by media serial number won't work here since that would
be different for MASTER and BACKUP.
symlinks again
How do other file systems solve (or approach) these problems?
I prefer to not have the *file system* solve this problem.
You have a virtual partition (chunk of disk space) with properties
you like. Then put a filesystem on it.
Gordon L. Burditt |
|
|
| Back to top |
|
 |
Brian Inglis *nix forums beginner
Joined: 16 Feb 2005
Posts: 22
|
Posted: Thu Mar 02, 2006 4:28 am Post subject:
Re: Whats the practical maximum file size using indexed allocation (I nodes)
|
|
|
On Wed, 1 Mar 2006 23:47:02 +0300 in comp.unix.internals, "Maxim S.
Shatskih" <maxim@storagecraft.com> wrote:
| Quote: | And BTW, in the Unix world at least, the terms "volume", "partition",
"drive", and "filesystem" refer to potentially divergent things.
Isn't "volume" and "filesystem" synonims? Yes, "drive" and "partition" are
another things, but "volume" and "filesystem"?
|
Used to be a volume was the media in a drive; a partition was a
subdivision of a volume, and a filesystem could be written in a
partition.
Now, a logical volume can span multiple partitions in various ways,
and a filesystem can be written in a logical volume.
--
Thanks. Take care, Brian Inglis Calgary, Alberta, Canada
Brian.Inglis@CSi.com (Brian[dot]Inglis{at}SystematicSW[dot]ab[dot]ca)
fake address use address above to reply |
|
| Back to top |
|
 |
Reginald Beardsley *nix forums beginner
Joined: 13 Dec 2005
Posts: 22
|
Posted: Sat Mar 11, 2006 2:08 pm Post subject:
Re: Multi-volume integrated storage
|
|
|
Having recently looked at a large (i.e. petabyte) filesystem structure
at a large oil company I will state emphatically that symlinks are NOT
the answer. The briefest perusal of the filesystem encountered many
dangling symlinks. A coworker reported encountering giant cycles when
trying to find data in the mess.
Much, but likely not all, of what is being sought can be implemented
using the automounter w/ environment variables in the tables. This
makes it possible to refer to the same place w/ different paths and
different places w/ the same path in a transparent fashion.
Done on top of volume management, a quick yppush will divert the hosts
from a failing filesystem to a backup filesystem. There is, of course,
a bit of detail to getting things unmounted and remounted to be solved ;-)
I did this 10-15 years ago at another oil company to allow a common path
which transparently referenced the correct binaries. It worked quite
well. I even got a nice note from an admin several years later that
what he was expecting to be a long night went very quickly because
/tool/bin/ was valid on all platforms and hosts and only required a
single binary instance per platform.
The map entries look like:
/tool ${HOST}:/tool/${ARCH}
w/ appropriate decoration for indirect or direct maps. The ${HOST} part
will allow a machine to use some specific copy of a filesystem which has
been replicated w/o affecting any other systems which can be very useful
for testing a newly constructed filesystem. Naturally, you need to pass
the automounter the flag telling it to use environment variables and you
need to set the enviroment variables at boot time.
Graph theory still applies.
Reg |
|
| Back to top |
|
 |
Valentin Nechayev *nix forums beginner
Joined: 30 Apr 2003
Posts: 14
|
Posted: Sun Apr 30, 2006 6:51 am Post subject:
Re: Whats the practical maximum file size using indexed allocation (I nodes)
|
|
|
Mon, Feb 27, 2006 at 05:11:43, random832 (Jordan Abel) wrote about "Whats the practical maximum file size using indexed allocation (I nodes)":
| Quote: | I won't disagree that binary compatibility is good - but off_t should
have been 64-bit to start with. On FreeBSD, off_t has NEVER been less
than 64 bits.
Actually, the API for FreeBSD V1.X had 32 bit lseek arguments. FreeBSD V2.X
had the proper (64bit) offset api, but wasn't fully implemented until about
V2.2.X... (I know, I wrote a lot of the lower level infrastructure.)
JA> How far back does the cvsweb go, in terms of what version of freebsd? I |
JA> traced off_t through all the headers it was in, and it was never
JA> typedef'd to anything other than long long or int64_t.
Current FreeBSD CVS repository doesn't contain code for versions
before 2.0.0 due to licensing reasons (BSD<->AT&T suit; FreeBSD 1.*
was built on Net/2, while FreeBSD 2.0 was built on Lite1).
-netch- |
|
| Back to top |
|
 |
Google
|
|
| Back to top |
|
 |
|
|
The time now is Wed Dec 03, 2008 1:20 am | All times are GMT
|
|
Loan | Mortgages | Credit Cards | Bad Credit Mortgages | Best Credit Cards
|
|
Copyright © 2004-2005 DeniX Solutions SRL
|
|
|
|
Other DeniX Solutions sites:
Unix/Linux blog |
electronics forum |
medicine forum |
science forum |
|
|
Privacy Policy
|
Powered by phpBB © 2001, 2005 phpBB Group
|
|