Orange reads and writes files in a number of different formats.
Loading examples is trivial. To load the data from the file iris.tab
, you simply type
You can, of course, also give a relative or absolute path to the file. Well, ExampleTable
is even smarted than that. You can omit the file extension.
This will do the same - Orange will look for any file with stem "iris" and a recognizable extension, such as .tab, .names, If orange discovers that you have, for instance, iris.tab and iris.names in the current directory, it will issue an error. You will have to provide the extension in this case.
When reading Excel files with multiple worksheets, the active worksheet (the one which was visible when the file was last saved) is read. To override this, you can specify a worksheet by appending a "#" and the worksheet's name to the file name. If your iris file is in Excel's format in a worksheet named "train", you can read it by
What follows is a bit more complicated, so beginners may want to skip it.
There's a slight complication, which occurs when you have, for instance, separate files for training and testing examples.
Orange distinguishes between attributes based on descriptors, not names. That is to say, you can have two different attributes with the same name. If one is used for learning, the classifier would not recognize the second attribute as the same attribute when testing. When loading the data, Orange thus examines a list of all existing attributes and reuses them whenever possible, that is, when there exists an attribute with the same name and type and, in case of discrete attributes, the same order of values. The order of values only matters when it is explicitly set in the data file (e.g. in the second line of a tab-delimited file).
There are only two special cases which require user intervention. If the user accidentally specified different order of values for two same attributes or he first loaded a data file without any prescribed order (so the values are sorted alphabetically), but then went on to load a file with a different prescribed order. The consequences are bad: any classifier trained on one and used on the other data set would treat the attribute values as missing. This is, however, a user's mistake which is hard to prevent.
A more common is the opposite case: the user loads two different data sets with two different attributes which accidentally have the same name and no conflict in the order of values. The attribute is reused, so the list of values now contains the values for both attributes. For instance, the first data set has an attribute a
with values 1
and 2
, and the other has an attribute a
with values yes
and no
. Without the order being prescribed, we get a common attribute a
with values 1
, 2
, yes
and no
. This is, however, difficult to miss and simple to correct.
orange.ExampleTable
has an additional flag with which we can tell when to construct new attributes. By default (orange.Variable.MakeStatus.Incompatible
), new attributes are constructed only when there exists no attribute with the same name or type or when its order of values is incompatible with the order for the new attribute. One can decide not to reuse the attribute if the two attributes have no common values (orange.Variable.MakeStatus.NoRecognizedValues
), which would help in the above case. To be even stricter, one can require the old attribute to have all the values of the new one (orange.Variable.MakeStatus.MissingValues
). Finally, all attributes will be constructed anew if orange.Variable.MakeStatus.OK
is used.
After loading, the instance ExampleTable
has an attribute attributeLoadStatus
, which describes, for each attribute, the status of the attribute found among the existing attributes. So, if attributeLoadStatus[3]
equals Variable.MakeStatus.Incompatible
, the fourth attribute was not reused since the existing attribute of the same name had an incompatible order of values. If attributeLoadStatus[4]
equals Variable.MissingValues
, the candidate for reuse for the fifth attribute had some missing values; whether it has been reused or we have a new attribute depends upon the flag passed to ExampleTable
. The same information for meta attributes is provided in a dictionary metaAttributeLoadStatus
, where the key is a meta-attribute id and the value is the corresponding status.
Loading files essentially uses the function Variable.make
for creating attributes. A detailed description of the statuses is available there
.
Note: these things have been in the past handler through domain depots and special arguments to ExampleTable
by which the user could give a list of domains and/or attributes to reuse. The system has been complicated and unclear, so it has been abandoned. Domain depots still exist, yet we consider removing them, too.
Saving examples to files is even simpler. ExampleTable
(actually its ancestor, ExampleGenerator
) implements a method save
which accepts a single argument, file name, from which it also deduces the format. For instance, if examples are stored in ExampleTable
data
, we can save them as tab-delimited file by
If the file format requires multiple files (such as C4.5, where the attribute definitions and the examples are separate files with different extensions), specify one of them and Orange will make the things right. For instance, to save the data in C4.5, call either
mydata.names
and mydata.data
in both cases.
Some formats are not able to store all types of data sets. For instance, C4.5 can only store examples where the outcome is discrete.
Can I save the files using some other extension than the default? Yes, for each file format there is a separate saving function (saveC45
, saveTabDelimited
, saveBasket
, saveCsv
or saveTxt
), which accepts two arguments, a filename and the examples. The function will decorate the file name according to the format's customs. One-file formats (such as tab-delimited or basket) will be given the default extension (such as .tab or .basket) unless you provide one. For C45, the domain definition files will be given extension .names, while data files will be given the standard extensions (.data) only if the filename that you've given as argument doesn't have an extension (ie, doesn't have any dots to the left of the last path separator). These functions are otherwise rather obsolete, so you should avoid them when possible.