[openbeosstorage] Re: DiskDevice API 2.x, Kernelland Draft
- From: "Ingo Weinhold" <bonefish@xxxxxxxxxxxxxxx>
- To: openbeosstorage@xxxxxxxxxxxxx
- Date: Fri, 06 Jun 2003 22:59:24 +0200 CEST
On Thu, 05 Jun 2003 23:54:50 -0700 Tyler Dauwalder <tyler@xxxxxxxxxxxxx
> wrote:
> > > Here's comments on the discussion. I'll try to address the inline
> > > questions and comments in the headers tomorrow.
> >
> > That'd be great. :-)
>
> Sorry, I didn't get home until late tonight and I'm beat after
> replying
> to this behemoth. Tomorrow. :-)
Fair enough. :-)
[...]
> > > > > As written in KDiskDeviceJob.h, I'm not sure if we want to
> > > > > add a
> > > > > Cancel() method to cancel a job in progress. Scheduled jobs
> > > > > can
> > > > > easily
> > > > > be canceled by removing them from the job queue. I just
> > > > > realize,
> > > > > that
> > > > > we don't provide any methods in the userland API for
> > > > > canceling a
> > > > > job.
> > > > Yes, I think we should support cancel, for things like
> > > > defragging
> > > > that
> > > > can be interrupted. But we should also allow the job to specify
> > > > if
> > > > it
> > > > allows cancelling or not.
> >
> > Don't know, if that's necessary. Jobs, that can't be cancelled,
> > will
> > simply finish their task.
>
> So are you proposing we just always display cancellation as an
> option,
> and if the task doesn't support it, things just keep quietly plugging
> away when you click the "Cancel" button? I think we should offer the
> user some indication of which tasks can be canceled gracefully and
> which can't.
OK, why not. So, how about adding a uint32
BDiskDeviceJob::CancelProperties(), which would return a combination of
these:
enum {
B_DISK_DEVICE_JOB_CAN_CANCEL = 0x01,
B_DISK_DEVICE_JOB_STOP_ON_CANCEL = 0x02,
B_DISK_DEVICE_JOB_REVERSE_ON_CANCEL = 0x04,
};
The cancel method would have parameters `bool force' and `bool
reverse'. It wouldn't have any effect, when
B_DISK_DEVICE_JOB_CAN_CANCEL is not set. If `force' is true the
`reverse' parameter is ignored and the job is told to quit as soon as
possible, even, if that means, that the partition will be left in an
invalid state. Otherwise `reverse' indicates whether the effect of the
job shall be undone.
CancelProperties() would be live. It could change during the execution
of the job (e.g. repairing might have a read-only analysis phase that
can simply be canceled, while canceling the repairing phase might leave
the partition in an invalid state).
> > > We can cancel all jobs - there would pop up a requester which
> > > says
> > > "Canceling the operation will recreate the initial state - this
> > > can
> > > take a while", and just reverse the thing.
> > > Shouldn't be too hard, at least not for moving a partition
> > > around.
> >
> > Yep, for that task it would work. OTOH, something like initializing
> > a
> > partition is not so easy to be undone. ;-) Even more fun it will
> > be,
> > when several jobs are in progress in parallel.
>
> Okay, I can see supporting reversing for the currently running task
> only, even in the case of multiple simultaneous tasks (they *are*
> disjoint by definition, so it shouldn't matter if they're running
> simultaneously or not
Yep, that matters only in case of reversing more than one job, since
the second of two currently executing job queues might work on a
partition modified by the first one before. So, one would need to undo
the second queue's jobs first and then the ones of the first one. Not a
big problem, but it adds to the fun, especially because the question
has to be answered, whether to undo only currently executing job queues
or also the very one just finished, having been started even after the
currently active ones.
>), but I don't know that we can expect every
> implementation to be so thorough as to implement graceful
> cancellation,
> let alone reversing, for every operation. I'd like a way to know
> whether an operation is uninterruptible, cancellable, or reversible,
> and then indicate this through the UI somehow.
I think, my proposal above provides that functionality. :-)
[...]
On Fri, 06 Jun 2003 16:05:08 +0200 CEST "Axel Dörfler" <axeld@pinc-
software.de> wrote:
> "Ingo Weinhold" <bonefish@xxxxxxxxxxxxxxx> wrote:
> > On Thu, 05 Jun 2003 12:56:21 +0200 CEST "Axel Dörfler" <axeld@pinc-
> > software.de> wrote:
> > > First of all, I haven't had a deep look at it yet, because I
> > > didn't
> > > find the time. But I somehow wanted to answer this one now :)
> > Hehe. :-)
> > You know, of course, that you can't sneak out of having a look at
> > the
> > kernel stuff I proposed -- being the main kernel developer, not to
> > mention team lead. ;-)
>
> Uh oh, okay, I'll try ;-)
Thanks. :-)
[...]
> > > Having those shadow partitions (although I don't like the name
> > > much,
> > > something like "target partitions" would make clearer that the
> > > current partitions should be changed
> > As I said, I was lacking a better name... :-)
> > I find `target partition' a bit general/vague, though.
>
> At least it does point in a direction ;-) target_partition might be a
> bit vague, but I really wouldn't understand shadow_partition, if I'd
> not knew about it.
Damn, that means, we have to write documentation. ;-)
> But anyway, do I understand the procedure correctly?
> First, you'd need to get all current partitions on a disk, but I
> somehow don't see how this would be possible using that API? Where do
> I
> get those partition IDs from?
Do you mean how the userland API gets them, or the kernel? For the
latter, that is trivial, since it generates them -- for any partition
found on a disk and any newly created one. For the userland API it is
not difficult either. One can iterate through the disk devices via
get_next_disk_device_id() and retrieve a complete info using
get_disk_device_data(), including all partitions.
> Currently, the standard way to iterate over all disk devices is to
> iterate over all entries under the /dev/disk path. But that will
> return
> all disks, if partitioned or not, even if present or not (in the case
> of removable media). If you got a device you'd ask for the partitions
> using ioctl().
If you're referring to B_GET_PARTITION_INFO, given a virtual device
published for a partition it simply returns what you (or someone else)
told it before via B_SET_PARTITION. There's no intelligence behind
this.
The disk device manager when being initialized will indeed iterate
through all /dev/disk/.../raw devices and creates a KDiskDevice for
each of them (thereafter it receives notifications for addition/removal
of devices as well as media changes). For the devices with a media
present, all disk systems are asked to identify its contents. The one
returning the highest priority is told to scan the partition. It
retrieves all relevant data and, in case of a partitioning system,
creates child KPartitions, for each of which the identifying/scanning
process is run again. On creation of a KPartition object (or when
making it known to the disk device manager) a fresh ID is assigned to
the partition, which it keeps until it is destroyed.
The IDs are not persistent, they are valid only during a session.
> So, assuming we somehow got to those partition IDs, would it be
> correct
> to do this:
>
> prepare_disk_device_modifications(device);
> // this will lock the disk device API (or just for this device, if
> possible)
It makes subsequent calls to prepare_disk_device_modifications() on the
same disk device fail for any other API user, therefore disabling them
to use any of the modification functionality provided.
> // (or even several for a software RAID)
I didn't think too hard about RAID and combining of several partitions
to a bigger one, but the disk device manager would definitely need to
have knowledge about it and do the necessary locking.
> defragment_partition(partition);
> resize_partition(partition, 100*1024*1024);
> // this will add the jobs to the job list
More precisely, this will modify the shadow partition for `partition'
(i.e. setting the defragmentation flag and the size attribute of the
respective KPartition).
> commit_disk_device_modifications(device, ...);
> // this will finally trigger the modifications to be made
> // and unlock the API - or will it first process all jobs
> // and the unlock the API?
It will compare the original with the shadow hierarchy and create and
schedule the jobs needed to transform the original hierarchy
respectively. The shadow hierarchy is deleted, and all partitions that
might be affected by any scheduled job are marked `busy', all ancestors
of `busy' partitions are marked `descendant-busy'. When the syscall
returns, a subsequent prepare_disk_device_modifications() on that disk
device will succeed -- only partitions that are not marked `*busy' are
allowed to be modified then, though.
> Where do we need the shadow partition name anyway? It doesn't seem to
> be part of the user API.
You mean KPartition::Name()? If one sets the name of a partition via
the userland API, it should also be reflected in the partition data, I
think. We currently don't have the functionality to set the name of a
partition in the userland API. We may want to add it, though. Maybe we
also want to discriminate between Name() and ContentName() (e.g.
partition name vs. FS/volume name).
> Also, the real user API is the C++ API, right?
Yep.
> So the user will never
> come across all those user_ prefixes - because if he would, I would
> consider dumping them.
I introduced the prefix for partition_data and disk_device_data,
because there're equally named structures for the partition module/FS
add-on interfaces and KPartition/KDiskDevice will probably have to deal
with both. For consistency I added he prefix to the other userland
structures as well.
> And if we have all these int32 IDs we could think about adding a new
> type for them, like partition_id.
Good idea. What about IDs for disk systems and disk device jobs?
disk_system_id, disk_device_job_id? Too long?
[...]
> > Regarding fs_initialize_volume(), it would be at best a convenience
> > function, nothing more. To reply to your arguments: Your first
> > paragraph just doesn't apply. Both calls initialize_partition() and
> > fs_initialize_volume() end at the same FS hook, which gets a
> > (partition) device path.
>
> What I meant was that the partition must exist at this point in time;
> if it does, then there is no problem.
Yep, it does.
> > The latter method is not more direct than the former one. Well,
> > more
> > convenient, if you mean that, but not more direct with respect to
> > the
> > functions involved. The call cannot be directly passed to the FS,
> > but
> > has to go through the disk device manager, since it must be
> > coordinated
> > with other operations on the disk device. Now things get a bit
> > difficult, for fs_initialize_volume() is synchronous, while the
> > disk
> > device jobs aren't. Moreover the disk device could be locked by a
> > userland API user, so that the thread couldn't even get the job
> > scheduled -- it could simply fail in this case, though.
> >
> > As I mentioned, creating file system in file could by addressed by
> > providing an API to register files as disk devices. A worthwile
> > thing
> > to do, I think, since that would even allow to initialize
> > partitioning
> > systems.
> >
> > To sum it up, I would see fs_initialize_volume() mainly as a
> > concenience function for one purpose, to create FSs in files. I
> > would
> > discourage application on partition devices (how did the caller get
> > hold of the partition device path, anyway?).
>
> Okay, you got me thinking :-) And I am also unsure about the
> justification of my previous rant :)
> What I completely missed was the fact that it really makes sense to
> direct any calls to fs_initialize_volume() through the disk device
> manager. It wouldn't be necessary to do so, because as long as there
> is
> a device (in /dev/disk/...), *anybody* can write to it.
It should be discouraged to do that, though. The reason we are writing
this nice DiskDevice API for is to provide convenient and consistent
access to disk devices and partitions. Everyone using the devices
directly is a party-pooper in my eyes. ;-)
> But of course, direct access could be dangerous in this case, so
> directing that call through the disk device manager would add some
> value.
Definitely.
> What I also don't understand with this API is how to create a new
> partition?
BPartition::CreateChild(), being mapped to the create_child_partition()
syscall.
> And if you want initialize_partition() to accept a file
> system and a partitioning system at the same place they had to share
> the same namespace, right?
Yes. Shouldn't be to bad a problem in my opinion, though.
> Is there a way for a user application to differentiate between the
> two?
BDiskSystem::IsFileSystem()/IsPartitioningSystem(). I just see, there's
no way to get a BDiskSystem for a name yet (other then iterating
through all disk system). Maybe we should add a
BDiskDeviceRoster::GetDiskSystem(const char *name, BDiskSystem *
diskSystem).
> I would be a little bit surprised if our general "mkfs" would create
> a
> partition on the disk.
That can be avoided. No problem. Although one could as well rename mkfs
to mkds. :-P
> We might also add a "name" field to initialize_partition() since
> almost
> all disk systems share this property (IIRC only Intel style
> partitioning doesn't have it).
If we do that, we should also make the functionality to set the name
explicit (or even `name' and `content name').
> I think I would find it cleaner if there
> are two functions to do that, anyway, even if embedded in the disk
> device manager context. For example, resizing a partition needs *two*
>
> resize calls internally, one to the partition, and one to the file
> system which is on that partition.
I actually find the disk system abstraction very suitable. Regarding
the handling of resizing a partition there is no difference between
file and partitioning systems. If in your example the partition does
not contain a FS but a nested partitioning system, exactly the same has
to be done -- the partition has to be resized and also its contents.
> OTOH I think it would be nice to have an add_partition() function -
> if
> we can do resize_partition(), why shouldn't we be able to do this?
> What steps would be involved to create 4 partitions on a given
> (empty)
> disk? I would like something as (may not fit perfectly in the
> proposed
> API, though):
>
> prepare_disk_device_modifications(device);
> initialize_disk_system(device);
> add_partition(device, "first", 100*1024*1024, "active=true");
> add_partition(device, "second", 50*1024*1024, NULL);
> ...
> commit_disk_device_modifications(device);
If you rename add_partition() to create_child_partition() that is quite
close to how it is intended to work. :-)
> As you mentioned, the only problem which we would still have is with
> the image files. Though I would really like us to be able to simply
> register/publish them as a device, it would also be nice to create a
> file system on them without having to register them.
> For example, that would be the use for the fs_shell, but perhaps also
> other things, although I don't have a good idea right now.
> Or to say it this way: I like to have no real differentiation between
> block devices and files - they are almost the same from the OS point
> of
> view, why shouldn't we keep this? Registering a file as device would
> be
> certainly a step into the right direction - but it would be even
> nicer
> if that wouldn't be necessary at all.
> We would also need to differentiate between file system images and
> disk
> images - but I guess we're doing this anyway.
>
> Well, I think it would be okay to not consider the fs_shell case, but
> to have mkfs register files automatically - the only thing we should
> support, though, is that we should continue to be able to address
> this
> file using its standard path. I.e. something like:
> $ mkfs test.image
> $ makebootable test.image
>
> Shouldn't fail even if "makebootable" needs a device to operate on.
> Dunno how we should do that right now, though. Maybe it's not even a
> top priority - but it'd be nice to have it.
OK, then how about the following additions to BDiskDeviceRoster?
partition_id RegisterDeviceFile(const char *filename);
// publishes: /dev/disk/virtual/files/<disk device ID>/raw
status_t UnregisterDeviceFile(const char *filename);
status_t UnregisterDeviceFile(partition_id device);
partition_id GetDeviceForPath(const char *filename, BDiskDevice *
device,
bool registerIfFile =
false);
partition_id GetPartitionForPath(const char *filename, BDiskDevice *
device,
BPartition
**partition,
bool registerIfFile = false);
mkfs (or fs_initialize_volume(), if we implement it using the
DiskDevice API) would only have to call one method to get hold of the
BPartition in questions. So, that should be rather convenient, I
suppose.
makebootable could get the BPartition the same way, check whether it
can deal with that FS, and get the path of the partition via
BPartition::GetPath() (well, it could as well use the supplied path).
It then opens the path and does, what it has to do. makebootable is a
bit tricky anyway, since what it does certainly depends on the
processor, the FS, and the partitioning scheme used.
[...]
> > > How ugly it gets for resizing a partition would be the job of the
> > > file system to judge on. But since we already need to have
> > > logging
> > > for those jobs, reversing the operation at any point shouldn't be
> > > impossible (or too ugly) at all.
> > We need logging for those jobs? I planned to gracefully leave that
> > out
> > for R1. :-P
>
> Well, I would also say that we don't implement any logging for this
> stuff in R1. *But* the API shouldn't reflect this in any way, I
> guess.
> It should just make sure that the user will have all the information
> he
> needs - like "Pressing cancel will destroy all data on disk" vs.
> "Pressing cancel will restore the initial state of the currently
> processed job" :-)
OK, that can be done.
> Tyler Dauwalder <tyler@xxxxxxxxxxxxx> wrote:
> > Yes. I just think it's clearer for someone who's writing their
> > first
> > fs/partition add-on if there's an explicit function there for each
> > operation rather than making them dig thru documentation or headers
> > elsewhere to figure out what operations they can/should support,
> > and
> > how they should support them. One could multiplex the entire fs add
> > -
> > on
> > suite thru a single function, but I think that's a much less
> > attractive
> > way to do it than the current setup.
> >
> > So really, I actually prefer the no multiplexing route in both
> > cases;
> > I
> > just think it's tolerable in the syscalls, since we're basically
> > the
> > only ones who'll be using them, and I can see the argument for not
> > polluting the syscall namespace.
Apropos syscall namespace, has a decision been made, what it shall look
like: sys_*() or _k*_()?
> That would also be the only problem I see - but since *we* are
> creating
> all syscall names ourselves, and we have beautiful names such as
> "prepare_disk_device_modifications" I can hardly think of any other
> use
> for that name :-)
> Also, having them as separate calls increases the possibilities to
> check their arguments.
At least it would be easier for the kernel side of the syscall.
BTW, how does one check, if a passed buffer exists and has indeed the
size the caller claims? Is there some convenient check function?
> > > As I said, I was lacking a better name... :-)
> > > I find `target partition' a bit general/vague, though.
> > I actually really like the name "shadow partitions". I think it
> > does
> > a
> > pretty good job of capturing the idea of what's going on. :-)
> >
> > Other ideas:
> >
> > CreateWorkingPartition()
> > CreateEditablePartition()
> > CreateWorkingCopyPartition()
> > CreateEditableCopyPartition()
> > ...
> >
> > I like shadow partitions just fine, though. :-)
>
> Well, if you insist on it... :-)
From the names heard so far, I would also favor the current one. :-)
[...]
> > > To sum it up, I would see fs_initialize_volume() mainly as a
> > > concenience function for one purpose, to create FSs in files. I
> > > would
> > > discourage application on partition devices (how did the caller
> > > get
> > > hold of the partition device path, anyway?).
> > I agree.
>
> Well, the standard method to get a device is by looking in the /dev
> path - and this is perfectly legal, we can't do anything against
> this,
> nor should we.
> OTOH we should make sure, that at least our API plays nicely
> together.
Who manually writes to any /dev/disk/... device should be very sure to
know what they are doing. (For a multi-user system, only the superuser
should be allowed to do that, I think.)
> > > We need logging for those jobs? I planned to gracefully leave
> > > that
> > > out
> > > for R1. :-P
> > Yes, let's get this clear. First I thought we were leaving it out
> > for
> > R1, then I thought it was in, then out again, and now... :-) Which
> > is
> > it?
>
> Logging possibility on file system level should always be there, its
> implementation doesn't have to be, though (even in R2).
> The partition resizing stuff itself doesn't have to be logged for R1,
> IIRC :)
Seems we do all agree. :-)
> > > > It could be nice to have this, and it would also be a bit
> > > > faster.
> > > > I
> > > > am not sure if it's rendering the API inconsistent there,
> > > > though.
> > > > One
> > > > could argue that the partition related functions (not
> > > > initialize())
> > > > are a bit separated from the rest anyway.
> > > Either way would work for me. So, if you have any preferences...
> > I prefer passing the device. :-)
>
> Okay, so let's do that, then :-)
So, then it's a FD plus the ID of the partition (something is needed to
identify it), I suppose.
CU, Ingo
- Follow-Ups:
- [openbeosstorage] Re: DiskDevice API 2.x, Kernelland Draft
- From: Tyler Dauwalder
- References:
- [openbeosstorage] Re: DiskDevice API 2.x, Kernelland Draft
- From: Tyler Dauwalder
Other related posts:
- » [openbeosstorage] DiskDevice API 2.x, Kernelland Draft
- » [openbeosstorage] Re: DiskDevice API 2.x, Kernelland Draft
- » [openbeosstorage] Re: DiskDevice API 2.x, Kernelland Draft
- » [openbeosstorage] Re: DiskDevice API 2.x, Kernelland Draft
- » [openbeosstorage] Re: DiskDevice API 2.x, Kernelland Draft
- » [openbeosstorage] Re: DiskDevice API 2.x, Kernelland Draft
- » [openbeosstorage] Re: DiskDevice API 2.x, Kernelland Draft
- » [openbeosstorage] Re: DiskDevice API 2.x, Kernelland Draft
- » [openbeosstorage] Re: DiskDevice API 2.x, Kernelland Draft
- » [openbeosstorage] Re: DiskDevice API 2.x, Kernelland Draft
- » [openbeosstorage] Re: DiskDevice API 2.x, Kernelland Draft
- [openbeosstorage] Re: DiskDevice API 2.x, Kernelland Draft
- From: Tyler Dauwalder
- [openbeosstorage] Re: DiskDevice API 2.x, Kernelland Draft
- From: Tyler Dauwalder