File
Structure and Size
(For
simple data files)
When you store data in a file, the simplest data files can be imported
into many different kinds of analysis programs. Those files are easy
to generate and have great versatility. It can be important to understand
how those files are structured. Here is an example of a data file.
0.01 33.19
0.02 33.45
0.03 35.70
|
Note the following:
-
Point 1:
The data in the file is composed of characters - not numbers.
As a human being, you
want to interpret the characters in the file as numbers, but you need to
focus on them being characters.
-
In the data above, the
first character is a "0".
-
In the data above, the
second character is a "." (a period).
-
etc.
Then:
-
Point 2:
Each character is represented by a single ASCII
character which takes a single byte of digital storage.
In a data file, the data
is usually interpreted as being in rows and columns. There are special
characters - called delimiters - that are used to separate columns.
-
Point 3:
Delimiters used to separate columns are usually one of the following.
-
Tab characters
(number 9 in the ASCII table) Note that tab characters are not part
of HTML so copying file data into an HTML file (as in the table above)
replaces tabs with spaces. Files using tabs as delimters are called
Tab-delimited files.
-
Commas
(number 44 in the ASCII table) Files using commas as delimiters are
called Comma-delimited files.
-
Point 4:
The end of a row in a tab-delimited or a comma-delimited file is usually
marked by using two characters.
-
A Carriage Return
(Number 13 in the ASCII table - denoted by '\r' in many programming languages
and by 'CR' in many others) and
-
A Line Feed
(Number 10 in the ASCII table - denoted by '\n' in many programming languages
and by 'LF' in many others)
-
Note that the CR + LF
combination is used in many other situations. For example, it is
common to find that combination at the end of a data string when an instrument
sends data to a computer. That's used to indicate the end of the
data string.
When you encounter a file constructed as above, then it is easy to caculate
the size of the file. In the file above:
-
There are four characters
in the "0.01" string - three numbers and a period.
-
There is a tab character
between the two columns.
-
There are five characters
in the "33.19" string - four numbers and a period.
-
There are two characters
at the end of the row - a carriage return and a line feed.
-
There are twelve characters
in each row.
-
There are three rows,
thus 36 total characters.
-
Each character is a byte,
so the file size should be 36 bytes.
Exercise
If you are in Windows, open Notepad (or if you are in another operating
system, open a simple text editor) and type in the file above. (If
you copy and paste you will miss the tabs since they have been replaced
by spaces!) After saving the file, right-mouse click on the file
icon and determine the file size. It should match the prediction
above.