To Read Data From a File of an Unspecified Length an Eof-controlled Loop Is a Good Choice

Introduction

Of all of the problems posted by beginner C++ and C programmers on sites like Reddit and Stack Overflow, surely the commonest is the confusion over the treatment of the end-of-file condition when writing programs that either interact with the user or read input from files. My ain guess is that over 95% of questions exhibit a complete misunderstanding of the end-of-file concept. This article is an endeavor to explain all of the issues regarding this confused and confusing subject field, with item reference to C++ and C programmers using the Windows and Unix-like (such as Linux, which I'll apply from now on as the exemplar for this OS type) operating systems.

The myth of the EOF character

The first trouble that many beginners confront when confronted with the end-of-file issue is that of the EOF character – basically, there isn't i, but people think in that location is. Neither the Windows nor the Linux operating systems have any concept of a marker character for indicating the stop-of-file. If yous create a text file using Notepad on Windows, or Vim on Linux, or any other editor/OS combination you fancy, that file will nowhere in information technology contain a special character marker the end-of-file. Both Windows and Linux have file systems that always know the exact length in bytes of the contents of a file, and have absolutely no need of any special graphic symbol marking the file's end.

And so if neither Windows nor Linux use an EOF character, where does the idea come from? Well, long agone and far away, in that location was an operating system (to utilise the term loosely) called CP/M, which ran on 8-bit processors similar the Zilog Z80 and Intel 8080. The CP/M file arrangement did not know exactly how long a file was in  bytes – it merely knew how many disk blocks it took upward. This meant that if y'all edited a small file containing the text hello globe, CP/Thousand would not know that the file was 11 bytes long – it would only know information technology took upwardly a single disk block, which would be a minimum of 128 bytes. Every bit people more often than not similar to know how big their files announced to be, rather than the number of blocks they take upward, an end-of-file graphic symbol was needed. CP/One thousand re-used the Command-Z character (decimal lawmaking 26, hex 1A, original intended use lost in the mists of time) from the ASCII character set for this purpose – when a CP/K application read a Control-Z character it would typically treat that read every bit though an cease-of-file had occurred.. At that place was nothing forcing applications to do this; apps that candy binary information would need some other means of knowing if they were at the cease-of-file, and the Os itself did not treat Control-Z especially.

When MS-DOS came along, compatibility with with CP/M was very important, equally a lot of the beginning MS-DOS applications were simply CP/Thousand apps that had been fed through mechanical translators that turned Z80/8080 machine code into 8086 motorcar lawmaking. As the applications were not re-written, they even so treated Control-Z as the end-of-file mark, and some do to this very day. In fact, this treatment of Control-Z is built in to the Microsoft C Runtime Library, if a file is opened in text manner. It'south of import to recapitulate that the Widows OS itself knows and cares goose egg well-nigh Control-Z – this behaviour is purely downwardly to the MS library, which unfortunately only virtually every Windows program uses. It's also important to realise that this is purely a Windows event – Linux (and the other Unixes) accept never used Control-Z (or anything else) equally an end-of-file mark in whatsoever shape or form.

Some demo code

You tin demonstrate this unfortunate characteristic of the MS libraries with this lawmaking. First, write a programme that puts a Control-Z into a text file

#include <iostream> #include <fstream> using namespace std;  int master() {     ofstream ofs( "myfile.txt" );     ofs << "line 1\north";     ofs << char(26);     ofs << "line 2\north"; }

If yous compile and run this on either Windows or Linux, information technology will create a text file with an embedded Control-Z (ASCII code 26) between the two lines of text. On neither platform does Control-Z have whatever special pregnant on output. Y'all can now try and read the file using control line facilities. On Windows:

c:\users\neilb\home\temp>blazon myfile.txt line 1

Discover merely the start line is displayed. On Linux:

[neilb@ophelia temp]$ cat myfile.txt line one ?line two

Both lines are displayed, and a strange character (represented hither by the question mark) is besides displayed betwixt them, as the cat command has just read Control-Z like any other character and printed it out – exactly what gets displayed depends on your terminal software.

This might seem to indicate that the Windows Os does know about the Control-Z character, merely that's not the case – only certain application code knows most it. If you open the file using the Windows Notepad utility, y'all will see this:

Screenshot_myfile.txt - Notepad_2012-12-04_12-43-56

Both lines of text are displayed, with the Control-Z character between them – Notepad does not know about Control-Z every bit an finish-of-file marker at all.

Text versus binary mode

So what is the divergence betwixt the blazon control used above and the Notepad application? It'southward actually hard to say. Possibly the type command has some special code that checks for the Control-Z grapheme in its input. However, Windows programmers using the C++ iostream library and the C stream library have the option of opening a file in either text manner or binary style, and this will make a divergence to what gets read.

Here's the canonical fashion to read a text file in C++:

#include <iostream> #include <fstream> #include <string> using namespace std;  int primary() {     ifstream ifs( "myfile.txt" );     cord line;     while( getline( ifs, line ) ) {         cout << line << '\northward';     } }

If you compile and run this file on Windows, you will run into that the Command-Z is treated every bit an end-of-file marker; the output is:

line i

However, if you open up the file in binary mode by making this change:

ifstream ifs( "myfile.txt", ios::binary );

the output is:

line ane ?line ii

The Command-Z grapheme is merely treated as existence special in text mode (the default) – in binary mode, it is but treated equally any other character. Note that this is merely true for Windows; on Linux both programs comport in exactly the aforementioned mode.

So what to do? There are two things to remember:

  • If y'all desire your files to open portably in text mode, don't embed Command-Z characters in them.
  • If you must embed Command-Z characters, and you desire the files to be read portably, open the files in binary mode.

Note that hither "portably" means "using any application, even if you but plan on Windows!"

Simply what about Control-D?

Some Linux users may at this betoken exist thinking, "Merely what about the Control-D character I use to finish crush input? Isn't that an end-of-file character?" Well, no, it isn't. If you embed a Command-D character in a text file, Linux will pay admittedly no attending to it. In fact, the Control-D y'all blazon in at the shell to finish input is simply a signal to the trounce to close the standard input stream. No graphic symbol is inserted into the stream at all. In fact, using the stty utility, you can change what graphic symbol causes standard input to be closed from Control-D to whatever you similar, but in no example volition a special character be inserted in the input stream, and even if information technology were, Linux would not treat it as an end-of-file marker.

The EOF value in C++ and C

Just to confuse things fifty-fifty more, both C++ and C ascertain a special value with the proper name EOF. In C, it is defined in <stdio.h> every bit:

#define	EOF	(-one)

and similarly in <cstdio> for C++.

Notice that EOF in this context has naught to do with Control-Z. Information technology doesn't have the value 26 and in fact in use it is not a character at all but an integer. It is used as the return value of functions like this:

int getchar(void);

The getchar() role is used to read private characters from standard input and returns the special value EOF when the end-of-file is reached. The end of file may or may be indicated past the Control-Z character (see discussion above), simply in no example will the EOF value be the same every bit  the ASCII lawmaking for Control-Z. In fact, getchar() returns an int, not  a char, and information technology's of import that its return value is stored in an int, as a comparison between a char and a signed integer is not guaranteed to work correctly. The canonical style to use this function to read standard input is:

#include <stdio.h>  int main() {     int c;     while( (c = getchar()) != EOF ) {         putchar( c );     } }

The eof() and feof() functions

Some other layer of confusion is added by both C++ and C providing functions to cheque the land of an input stream. Most all beginner programmers get dislocated by these functions, so it may be a skillful idea to state up-front what they do and how they should not be used:

Both eof() and feof() check the state of an input stream to meet if an stop-of-file condition has occurred. Such  a condition can only occur post-obit an attempted read operation. If you call either function without previously performing a read, your lawmaking is wrong! Never loop on an eof office.

To illustrate this, let'south write a plan that reads a file, and adds line numbers to the file contents on output. To simplify things, nosotros'll use a fixed file proper noun and skip whatever error checking. About beginners will write something like this:

#include <iostream> #include <fstream> #include <string> using namespace std;  int chief() {     ifstream ifs( "afile.txt" );     int n = 0;     while( ! ifs.eof() ) {         string line;         getline( ifs, line );         cout << ++due north << " " << line << '\n';     } }

This seems sensible enough, only call up the advice – "If you call either office without previously performing a read, your code is wrong!" and in this case we are indeed calling eof() before a read operation.  To see why this is wrong, consider what happens if afile.txt is an empty file. The outset time through the loop the bank check for eof() will fail, equally no read operation has occurred. We then read something, which will set the stop-of-file condition, only too late. And nosotros so output  a line, with line number 1, that does not be in the input file. Past similar logic, the plan e'er outputs one spurious extra line.

To write the plan properly, yous need to call the eof() part after the read operation, or not at all. If you are not expecting to encounter problems other than finish-of-file, you would write the code like this:

int main() {     ifstream ifs( "afile.txt" );     int due north = 0;     cord line;     while( getline( ifs, line ) ) {         cout << ++n << " " << line << '\n';     } }

This uses a conversion operator which turns the return value of getline() , which is the stream passed as the first parameter, into something that can be tested in a while-loop – the loop continues equally long as the stream is not in an end-of-file (or other error) condition.

Similarly in C. Y'all should not write code similar this:

#include <stdio.h>  int main() {     FILE * f = fopen( "afile.txt", "r" );     char line[100];     int northward = 0;     while( ! feof( f ) ) {         fgets( line, 100, f );         printf( "%d %s", ++due north, line );     }     fclose( f ); }

which volition almost certainly impress garbage if handed an empty file (and exhibit undefined behaviour too). Yous want:

#include <stdio.h>  int principal() {     FILE * f = fopen( "afile.txt", "r" );     char line[100];     int n = 0;     while( fgets( line, 100, f ) != NULL ) {         printf( "%d %s", ++due north, line );     }     fclose( f ); }

So if eof() and feof() are and then plain useless, why do the C++ and C Standard Libraries supply them? Well, they are useful in the example where a read error could exist caused by something other than stop-of-file, and you lot desire to distinguish if that's the case:

#include <iostream> #include <fstream> #include <string> using namespace std;  int principal() {     ifstream ifs( "afile.txt" );     int due north = 0;     string line;     while( getline( ifs, line ) ) {         cout << ++n << " " << line << '\n';     }     if ( ifs.eof() ) {         // OK - EOF is an expected condition     }     else {         // ERROR - we hit something other than EOF     } }

Summary

All the above may go far seem that the EOF issue is extremely complicated, simply it really only comes down to three bones rules:

  • In that location is no EOF graphic symbol, unless you lot open files in text fashion on Windows, or implement i yourself.
  • The EOF symbol in C++ and C is not an cease-of-file character, it is special return value of certain library functions.
  • Don't loop on the eof() or feof() functions.

If y'all keep these rules in mind, you should avoid existence bitten by about of the bugs associated with misunderstanding the nature of the end-of-file status in C++ and C.

fletcherlorip1937.blogspot.com

Source: https://latedev.wordpress.com/2012/12/04/all-about-eof/

0 Response to "To Read Data From a File of an Unspecified Length an Eof-controlled Loop Is a Good Choice"

Post a Comment

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel