Introduction to Java I/O
Have you ever wondered what the OutputStream and InputStream classes are? Let’s find out.
I/O stands for Input and Output. Java has a rich set of I/O classes in the core API — especially in the java.io package.
I/O in Java is divided into two:
- Byte and number oriented I/O which is handled by input and output streams.
- Character and text I/O which is handled by readers and writers.
Streams
A stream is an ordered sequence of bytes of undetermined length. There are 2 types of streams
- Input streams — They move data into a Java program from some external source.
- Output stream — They move bytes of data from Java to an external target.
An input stream may read from a finite source of bytes like a file or an infinite source of bytes like System.in
Where do streams come from
Streams may come from various sources among them:
- System.in
- Files
- Network connections
The Stream Classes
There are 2 main stream classes:
- OutputStream
- InputStream
They are abstract base classes for many different sub classes with more abilities including
- BufferedInputStream
- ByteArrayInputStream
- DataInputStream
- FileInputStream
- FilterInputStream
- LineNumberInputStream
- ObjectOutputStream
- PipedOutputStream
- PushbackInputStream
- StringBufferInputStream
- BufferedOutputStream
- ByteArrayOutputStream
- DataOutputStream
- FileOutputStream
- FilterOutputStream
- ObjectInputStream
- PipedInputStream
- PrintStream
- SequenceInputStream
Input streams read bytes and output stream write bytes. Readers read characters and writers write characters.
To understand input and output streams we need a solid understanding of how Java deals with bytes, integers, characters and other primitive data types and when and why one is converted into another.
Integer Data
Most common integer data type in Java is the int, a 32-bit, big-endian, two’s complement integer. Takes values between -2,147,483,648 and 2,147,483,647.
Longs are 64-bit, big-endian, two’s complement integers. Takes values between -9,223,372,036,854,775,808 and 9,223,372,036,854,775,807.
Shorts are 16-bit big-endian two’s complement integers with ranges between -32,768 and 32,767
Bytes are 8-bit two’s complement integer that ranges from -128 to 127. A byte too is signed.
By default a literal like 1245 is an int. If we were to convert this to a byte we would need to truncate the higher order bits. You can use the bitwise operations as follows:
int & 0x000000ff;
Character Data
Computers only understand numbers.
When dealing with characters, we need to map integers to characters. In ASCII for example, character Z is mapped to 90.
Different encodings have different mappings.
ASCII
It is a seven bit character set.
Defines 2⁷ or 128 different characters. These characters are sufficient for handling most of American English and make approximations for most of European languages.
ISO Latin-1
It is an eight bit character set.
Defines 2⁸ or 256 characters. First 128 characters correspond to ASCII. They diverge from 128 to 255
Provides just enough characters to write most Western Europe languages.
Unicode
ISO Latin-1 suffices for most Western European languages but does not work for Greek, Arabic, Hebrew, Persian languages.
Unicode is a 16 bit character set. Defines 2¹⁶ — 65536 different possible characters — only about 40000 are used.
First 256 characters correspond to ISO Latin-1.
You must have realized streams do not work fine for this. Streams are designed to read one byte at a time but this is 2 bytes. This is why we have readers and writers. Without readers and writers you multiply the first byte by 256 then add it to the second byte read and cast the result to a char.
Readers handle the conversion of bytes in one character set to Java chars without any extra effort. For similar reasons, you should use a writer rather than an output stream to write text.
UTF-8
Unicode is a relatively inefficient encoding when most of your text consists of ASCII characters. Every character requires the same number of bytes — two — even though some characters are used much more frequently than others. A more efficient encoding would use fewer bits for the more common characters. This is what UTF-8 does.
In UTF-8 the ASCII alphabet is encoded using a single byte, just as in ASCII. The next 1,919 characters are encoded in two bytes. The remaining Unicode characters are encoded in three bytes. However, since these three-byte characters are relatively uncommon, especially in English text, the savings achieved by encoding ASCII in a single byte more than makes up for it.
I hope this explains to some extend what Streams are and why readers and writers are needed.
We shall cover Output Streams next.