Robert Coleridge
Microsoft Developer Network Technology Group
July 1996
Click to open or copy the files in the NetGet sample application for this technical article.
This article discusses the Internet application programming interface (API) in general and then looks in detail at several of the API functions that would be foundational to anyone who is interested in writing Internet browser or crawler applications. By using these functions—InternetOpen, InternetOpenUrl, InternetReadFile, and InternetCloseHandle—you can easily put together a number of useful Internet-aware utilities and applications. This article also examines how to access the Internet, as well as access and download pages from the Internet.
The Win32® Internet functions ("WinInet" for short) are exported from the WININET.DLL. The functions are documented in the Web Workshop or the newest release of the Platform SDK.
To compile the NetGet sample application, you also need Microsoft® Visual C++® version 4.0 or later.
Programming for the Internet used to require knowledge of various protocols such as TCP/IP, an understanding of sockets, and so on. Fortunately for the majority of us, this is no longer the case. Developing an Internet-aware application can now be done easily and painlessly. This is done through a set of functions known as the WinInet API. These functions, by hiding most of the "techie" stuff for us, make programming for the Internet painless and productive. What used to take days now takes hours to create. Although there is still a certain level of knowledge required to use the API functions, it is now a significantly easier task to develop an Internet application.
The WinInet API currently comprises four interrelated groups of functions: general Internet URL functions, FTP functions, HTTP functions, and Gopher functions. A number of functions within each group overlap, but with such an impressive array of functions available to you, the world is at your fingertips—literally.
The Internet API provides general URL functions and protocol-specific functions, such as HTTP, FTP, and Gopher.
The first group of functions are more general or generic in nature in that they allow you to access FTP, HTTP, or Gopher simply by the associated URL. With this group of functions, access to Internet information is a simple three-step process. You first obtain a handle to a specified URL with one function, then use another function to read information with the handle, and lastly, close the handle. What could be easier? This all happens without your having to know much about what the functions are doing.
This group also provides functions for doing things such as combining URL components, breaking or cracking a URL into its components, and moving around a URL file (similar to setting a file pointer).
The next three groups of functions are grouped together based on their protocol. At present, there are groups of functions for HTTP, FTP, and Gopher. Each group deals with specifics of that protocol at a deeper level than does the more general, first group of functions.
Each of these protocol-specific groups of functions has overlapping functionality: for example, each group uses InternetConnect to make the initial connection. Yet each group has functions that are more suited to the usage of that protocol. For example, the FTP functions enable you to manipulate directories, which for HTTP pages would not be of much use. The specifics of each group are beyond the scope of this article. For a comprehensive listing of the API function set, see the Web Workshop or the newest release of the Platform SDK.
The following sections discuss some Internet API functions that are more general or generic in their usage. I have selected the ones most likely to be used in a general Internet-aware application. To use these functions you need an Internet agent, such as Microsoft Internet Explorer. This agent, or browser, performs the actual Internet accessing, verification, and so forth.
For a general utility or application, you should start with the InternetOpen function. By specifying the Internet agent you want to do the Internet access (for example, Microsoft Internet Explorer), the type of access you want, and a few optional flags, this function returns you a handle to an Internet session. When you are finished with the connection, you must close it by passing the handle to the InternetCloseHandle function. For example:
HINTERNET hInternetSession;
hInternetSession = InternetOpen(
"Microsoft Internet Explorer", // agent
INTERNET_OPEN_TYPE_PRECONFIG, // access
NULL, NULL, 0); // defaults
.
.
.
InternetCloseHandle(hInternetSession);
This will make a connection to the Internet, using the Internet agent "Microsoft Internet Explorer," and return a handle to the connection, if successful. By specifying the parameter INTERNET_OPEN_TYPE_PRECONFIG, you have requested the agent to use certain values that are stored in the registry. The rest of the parameters are set to use the default configurations. In one simple call, you have made a connection to the Internet—assuming nothing went wrong with the connection, of course! With the returned hInternetSession handle used as a parameter to the other functions, you can start accessing Internet information.
With an Internet connection established by the InternetOpen function you can now access Internet information through the InternetOpenUrl function. This function allows you to access information on the Internet by specifying the URL you wish to access. This is done in a protocol-independent way, through HTTP, FTP, or Gopher. The function and agent sort all this out at run time. By passing in the handle obtained from a call to InternetOpen, along with the URL and a few optional parameters, this function (if successful) will return you a handle to the information. When you are finished with the connection to the URL, you must close it by passing the handle you received from the call to InternetOpen to the InternetCloseHandle function. You now can do want you want with that page or file (provided you do have access, of course). For example, to access the hypothel site http://www.acompany.com/welcome.htm you would simply do the following:
HINTERNET hURL;
HINTERNET hInternetSession;
.
.
.
hURL = InternetOpenUrl(
hInternetSession, // session handle
"http://www.acompany.com/welcome.htm", // URL to access
NULL, 0, 0, 0); // defaults
.
.
.
InternetCloseHandle(hURL);
It is as simple as that. Of course, this sample assumes that you obtained the hInternetSession handle before you made the call to InternetOpenUrl.
The InternetReadFile function is the one you would use to actually download Internet information into memory. You do this by simply passing in the handle to a URL (obtained from a previous call to InternetOpenUrl), a pointer to a buffer to receive the data, and the size of the buffer. Let me show you how to read 1024 bytes of a page into memory:
BOOL bResult;
char cBuffer[1024]; // I'm only going to access 1K of info.
DWORD dwBytesRead;
HINTERNET hURL;
.
.
.
bResult = InternetReadFile(
hURL, // handle to URL
(LPSTR)cBuffer, // pointer to buffer
(DWORD)1024, // size of buffer
&dwBytesRead); // pointer to var to hold return value
This example reads the information pointed to by the hURL handle, stores it in the cBuffer character buffer, and sets the variable dwBytesRead to the number of bytes stored in the buffer. That is all it takes to read some information off the Internet. (This example assumes that you will receive all of the information in one call. If not, you simply repeat the call until all information is received.)
If the return value is TRUE and the number of bytes read is zero, the transfer has been completed and there are no more bytes to read on the handle. This is the same as reaching EOF in a local file. The InternetCloseHandle function should always be called when the work with this handle is done.
The InternetCloseHandle function is the Internet equivalent of the Win32 CloseHandle API function. It is used to shut down the connections specified, be they from InternetOpenUrl or InternetOpen. It is quite simple to use:
HINTERNET hURL;
.
.
.
InternetCloseHandle(hURL);
It is imperative to remember that for Internet work, just as for Win32, you should always close connections when you are finished with them.
Let's say you wanted to do something really simple, like read a file from a given URL. A code segment might look like the following:
HINTERNET hInternetSession;
HINTERNET hURL;
char cBuffer[1024]; // I'm only going to access 1K of info.
BOOL bResult;
DWORD dwBytesRead;
// Make internet connection.
hInternetSession = InternetOpen(
"Microsoft Internet Explorer", // agent
INTERNET_OPEN_TYPE_PRECONFIG, // access
NULL, NULL, 0); // defaults
// Make connection to desired page.
hURL = InternetOpenUrl(
hInternetSession, // session handle
"http://www.acompany.com/welcome.htm", // URL to access
NULL, 0, 0, 0); // defaults
// Read page into memory buffer.
bResult = InternetReadFile(
hURL, // handle to URL
(LPSTR)cBuffer, // pointer to buffer
(DWORD)1024, // size of buffer
&dwBytesRead); // pointer to var to hold return value
// Close down connections.
InternetCloseHandle(hURL);
InternetCloseHandle(hInternetSession);
That is all it takes to connect to, access some information from, and disconnect from a specific URL on the Internet. As I said in the beginning of this article, it is a very simple task to do.
By using just a few Internet APIs I will show you how to write a console application that will "reach out and touch someone." This application, which I have called NetGet, will allow you to download information from the Internet (in page, file, or other format) and store it on your local machine. By using the Internet APIs discussed in this article plus one or two others, you will be able to get information from a single URL or multiple URLs, parse the HTML tags on these pages, and extract the files and links.
In effect, you will be writing your own Internet information extractor. I give you fair warning, though: downloading a large file, or a page that has intensive graphics or links, can result in filling your hard disk. Don't say I didn't warn you! Let's examine the basic process required to first download information onto your hard drive.
This procedure is the simplest to do. You first make a connection to the Internet via InternetOpen, and then make a connection to the desired URL with InternetOpenUrl. With that handle, download the information into memory with InternetReadFile. Once that information is in memory, you simply write the data to your local hard drive.
The following example assumes that you are only downloading a small file (1024 bytes). For the sake of brevity, some parameter lists are not filled in.
BOOL GetURLPageAndStoreToDisk(LPSTR pURLPage, LPSTR pOutputFile)
{
HINTERNET hSession;
HINTERNET hURL;
char cBuffer[1024]; // Assume small page for sample.
BOOL bResult;
DWORD dwBytesRead;
HANDLE hOutputFile;
// Make internet connection.
hSession = InternetOpen("Microsoft Internet Explorer", . . .
// make connection to desired page
hURL = InternetOpenUrl(hSession, pURLPage, . . .
// read page into memory buffer
bResult = InternetReadFile(hURL, (LPSTR)cBuffer,
(DWORD)1024, &dwBytesRead);
// close down connections
InternetCloseHandle(hURL);
InternetCloseHandle(hInternetSession);
// create output file
hOutputFile = CreateFile)pOutputFile, . . .
// write out data
bResult = WriteFile(hOutputFile, cBuffer, . . .
// close down file
CloseHandle(hOutputFile);
// return success
return(TRUE);
}
Now that you have looked at how to download a single file from the Internet, all you need to do to download multiple files is repetitively call the single-page procedure and pass in different URL references. For example:
GetURLPageAndStoreToDisk(
"HTTP://WWW.SOMESITE.COM/INTERESTING.HTM",
"C:\\PAGES\\INTEREST.HTM");
GetURLPageAndStoreToDisk(
"HTTP://WWW.OTHERSITE.COM/COOL.GIF",
"C:\\PAGES\\COOL.GIF");
etc.
Note that the use of the double backslash ("\\") in the code above has to do with C/C++ language syntax. It may need to be only single (\) in another language, such as Visual Basic. For details, see your language reference.
Parsing the HTML data is the most challenging aspect of this type of application. HTML is a tag-based language. What do I mean by "tag-based language"? Let's examine a line of HTML.
<a href="menu.htm"><img src="img/blue-icon.gif" border="0"></a>
There are two "tag" sets on this line. There is the "<a href . . .> . . . </a>" tag set and the "<img . . .>" tag. Both of these tags contain URLs. The "<a" tag contains "menu.htm", and the "<img" tag contains "img/blue-icon.gif". These URLs are relative to the page they are contained on. Either one could just have easily read something like "http://www.home.com/welcome.htm", which is an absolute URL.
The difficulty in parsing the tags presents us with two different challenges. The first challenge, as you have seen, is determining if a URL is relative or absolute, and adjusting it accordingly. The second challenge comes from trying to determine which tags you need to examine and which ones are meaningless for your application. For the NetGet application, I have supplied a simple data file containing a list of relevant tags. This file can be modified at any time, thus making the application more usable.
Please note that not all pages can be downloaded. For example, an .ASP file, which is a Microsoft Internet Server library module, often sits in an "execute-only" virtual directory somewhere. For these reasons I have coded the sample application to read another data file, containing acceptable extensions, and to download URLs with these extensions only. This file can be modified and extended by adding whatever extensions you would like.
Using only the above-mentioned APIs (and one or two other simple API functions) I have put together a sample application that demonstrates how to do the things discussed in this article. With this application you can:
In order to do these things I have provided functions to:
The first group of key functions this article examines are those used in conjunction with the HTML data. They are LookForTagAndExtractValue and ExtractTagValue. These functions, respectively, scan a line of data for a specified HTML tag, and extract the URL reference from that tag.
The first function, LookForTagAndExtractValue, operates by scanning the line of data for any tag found in the tag table. Once it has found what appears to be a match, it calls the ExtractTagValue function, which will extract the value from the tag. If a value is found, LookForTagAndExtractValue will null-terminate the returned string at any end-of-tag delimiter. If a tag and value are found, a result of TRUE is returned to the caller. If not found, FALSE is returned.
The second function, ExtractTagValue, simply scans through the data passed to it, looking for a specified keyword. Once it has found what appears to be a match, it looks on either side of the found tag to see if it is stand-alone (that is, does not have alphanumeric characters on either side, but only spaces or punctuation). If it has found a valid match, a pointer to the data is returned; if a match is not found, a NULL result is returned.
The sample code examined in the "Putting It All Together" section of this article is the function to actually retrieve Internet information via a URL. The function in the NetGet sample, called GetURLPage, does more than the "Putting It All Together" function. It encompasses the following:
The class was written to store a handle to the Internet connection, returned from a previous call to StartInternetSession. The function takes three parameters that require some explanation.
Once an HTML page has been downloaded, it must be parsed, if requested. This is done by the ParseURLPage function. This function encapsulates four functions, similar to the directory parsing functions in the Win32 API. The ParseOpen and ParseClose function pair simply initializes and cleans up the parsing structures and data. The ParseFirstURL and ParseNextURL pair are written to be used in a looping paradigm—that is, you call ParseFirstURL to retrieve the first URL in the HTML file, then process the URL reference, and continue calling ParseNextURL and processing until there are no more URLs to process.
There are other functions in the sample that are relevant and self-explanatory, for example, Set_Logging, which simply sets a Boolean, and MyInternetSplitPath, which implements a "C run time" splitpath type of routine for URLs. These functions are very well commented and left up to you to enjoy and explore.
Having read this article, you now have the information necessary to write your own simple Internet Browser utility. With this information, the vast resources available to developers and programmers on the Internet are now at your fingertips. How far you go with this information and what type of utilities you write is limited only by your imagination. So dream big and have fun exploring the Internet.