Contents Index Topic Contents | ||
Previous Topic: Common Functions Next Topic: Handling Authentication |
Handling Uniform Resource Locators
A Uniform Resource Locator (URL) is a compact representation of the location and access method for a resource located on the Internet. Each URL consists of a scheme (HTTP, HTTPS, FTP, or Gopher) and a scheme-specific string. This string can also include a combination of a directory path, search string, or name of the resource. The Win32 Internet functions provide the ability to create, combine, break down, and canonicalize URLs. For more information on URLs, see RFC 1738, Uniform Resource Locators (URL). This document can be found at http://ds.internic.net/rfc/rfc1738.txt.
The URL functions operate in a task-oriented manner. The content and format of the URL that is given to the function is not verified. The calling application should track the use of these functions to ensure that the data is in the intended format. For example, the InternetCanonicalizeUrl function would convert the character "%" into the escape sequence "%25" when using no flags. If InternetCanonicalizeUrl is used on the canonicalized URL, the escape sequence "%25" would be converted into the escape sequence "%2525", which would not work properly.
The following table summarizes the URL functions in the Win32 Internet API.
Function Description InternetCanonicalizeUrl Canonicalizes the URL that is passed to the function. By default, InternetCanonicalizeUrl always encodes. InternetCombineUrl Combines base and relative URLs into one complete URL. InternetCrackUrl Breaks the URL into its component parts. This information is returned in a URL_COMPONENTS structure. InternetCreateUrl Allows an application to create a complete URL from its component parts. InternetOpenUrl Locates the resource designated by a canonicalized URL and creates a handle, which can be used by InternetReadFile to retrieve that resource. InternetOpenUrl combines the tasks normally performed by InternetConnect with the operations handled by FtpOpenFile, GopherOpenFile, and HttpOpenRequest. What Is a Canonicalized URL?
The format of all URLs must follow the accepted syntax and semantics in order to access resources via the Internet. Canonicalization is the process of formatting a URL to follow this accepted syntax and semantics.
Characters that must be encoded include any characters that have no corresponding graphic character in the US-ASCII coded character set (hexadecimal 80-FF, which are not used in the US-ASCII coded character set, and hexadecimal 00-1F and 7F, which are control characters), blank spaces, "%" (which is used to encode other characters), and unsafe characters (<, >, ", #, {, }, |, \, ^, ~, [, ], and ').
Using the Win32 Internet Functions to Handle URLs
The following table summarizes the URL functions included with the Win32 Internet functions.
Function Description InternetCanonicalizeUrl Canonicalizes the URL. InternetCombineUrl Combines base and relative URLs. InternetCrackUrl Parses a URL string into components. InternetCreateUrl Creates a URL string from components. InternetOpenUrl Begins retrieving an FTP, Gopher, HTTP, or HTTPS resource. Canonicalizing URLs
Canonicalizing a URL is the process that converts a URL (that might contain unsafe characters such as blank spaces, reserved characters, and so on) into an accepted format.
The InternetCanonicalizeUrl function can be used to canonicalize URLs. This function is very task-oriented, so the application should track its use carefully. InternetCanonicalizeUrl does not verify that the URL passed to it is already canonicalized and that the URL that it returns is valid.
The following five flags control how InternetCanonicalizeUrl handles a particular URL. The flags can be used in combination. If no flags are used, the function encodes the URL by default.
Value Meaning ICU_BROWSER_MODE Do not encode or decode characters after "#" or "?", and do not remove trailing white space after "?". If this value is not specified, the entire URL is encoded, and trailing white space is removed. ICU_DECODE Convert all %XX sequences to characters, including escape sequences, before the URL is parsed. ICU_ENCODE_SPACES_ONLY Encode spaces only. ICU_NO_ENCODE Do not convert unsafe characters to escape sequences. ICU_NO_META Do not remove meta sequences (such as "." and "..") from the URL. The ICU_DECODE flag should be used only on canonicalized URLs, because it assumes that all %XX sequences are escape codes and converts them into the characters indicated by the code. If the URL has a "%" symbol in it that is not part of an escape code, ICU_DECODE still treats it as one. This characteristic might cause InternetCanonicalizeUrl to create an invalid URL.
To use InternetCanonicalizeUrl to return a completely decoded URL, the ICU_DECODE and ICU_NO_ENCODE flags must be specified. This setup assumes that the URL being passed to InternetCanonicalizeUrl has been previously canonicalized.
Combining base and relative URLs
A relative URL is a compact representation of the location of a resource relative to an absolute base URL. The base URL must be known to the parser and usually includes the scheme, network location, and parts of the URL path. An application can call InternetCombineUrl to combine the relative URL with its base URL. InternetCombineUrl will also canonicalize the resultant URL.
Cracking URLs
The InternetCrackUrl function separates a URL into its component parts and returns the components indicated by the URL_COMPONENTS structure that is passed to the function.
The components that make up the URL_COMPONENTS structure are the scheme number, host name, port number, user name, password, URL path, and additional information (such as search parameters). Each component, except the scheme and port numbers, has a string member that holds the information and a member that holds the length of the string member. The scheme and port numbers have only a member that stores the corresponding value; they are both returned on all successful calls to InternetCrackUrl.
To get the value of a particular component in the URL_COMPONENTS structure, the member that stores the string length of that component must be set to a nonzero value. The string member can be either the address of a buffer or NULL.
If the pointer member contains the address of a buffer, the string length member must contain the size of that buffer. InternetCrackUrl returns the component information as a string in the buffer and stores the string length in the string length member.
If the pointer member is set to NULL, the string length member can be set to any nonzero value. InternetCrackUrl stores the address of the first character of the URL string that contains the component information and sets the string length to the number of characters in the remaining part of the URL string that pertains to the component.
All pointer members set to NULL with a nonzero length member point to the appropriate starting point in the URL string. The length stored in the length member must be used to determine the end of the individual component's information.
To finish initializing the URL_COMPONENTS structure properly, the dwStructSize member must be set to the size of the URL_COMPONENTS structure.
The following example returns the components of the URL in the edit box, IDC_PreOpen1, and returns the components to the list box, IDC_PreOpenList. To display only the information for an individual component, this function copies the character immediately after the component's information in the string and temporarily replaces it with a NULL.
int WINAPI Cracker(HWND hX) { URL_COMPONENTS urlcmpTheUrl; int intTestSize = 80; LPSTR lpszUrlIn; LPURL_COMPONENTS lpUrlComp = &urlcmpTheUrl; char TempOut[256]; char tempChar; lpszUrlIn = new char[intTestSize]; GetDlgItemText(hX,IDC_PreOpen1,lpszUrlIn,intTestSize); SendDlgItemMessage(hX,IDC_PreOpenList,LB_RESETCONTENT,0,0); urlcmpTheUrl.dwStructSize = sizeof(urlcmpTheUrl); urlcmpTheUrl.lpszScheme = NULL; urlcmpTheUrl.lpszHostName = NULL; urlcmpTheUrl.lpszUserName = NULL; urlcmpTheUrl.lpszPassword = NULL; urlcmpTheUrl.lpszUrlPath = NULL; urlcmpTheUrl.lpszExtraInfo = NULL; /* the following lines set which components will be displayed */ urlcmpTheUrl.dwSchemeLength = 1; urlcmpTheUrl.dwHostNameLength = 1; urlcmpTheUrl.dwUserNameLength = 1; urlcmpTheUrl.dwPasswordLength = 1; urlcmpTheUrl.dwUrlPathLength = 1; urlcmpTheUrl.dwExtraInfoLength = 1; if (!InternetCrackUrl(lpszUrlIn,strlen(lpszUrlIn),0, lpUrlComp)) { ErrorOut(hX,GetLastError(),"Cracker"); return FALSE; } else { if (urlcmpTheUrl.dwSchemeLength != 0) { tempChar = urlcmpTheUrl.lpszScheme[urlcmpTheUrl.dwSchemeLength]; urlcmpTheUrl.lpszScheme[urlcmpTheUrl.dwSchemeLength]='\0'; sprintf(TempOut, "Scheme: %s", urlcmpTheUrl.lpszScheme); SendDlgItemMessage(hX,IDC_PreOpenList,LB_ADDSTRING,0, (LPARAM)TempOut); urlcmpTheUrl.lpszScheme[urlcmpTheUrl.dwSchemeLength]= tempChar; } sprintf(TempOut, "Scheme number: %d", urlcmpTheUrl.nScheme); SendDlgItemMessage(hX,IDC_PreOpenList,LB_ADDSTRING,0, (LPARAM)TempOut); if (urlcmpTheUrl.dwHostNameLength != 0) { tempChar = urlcmpTheUrl.lpszHostName[urlcmpTheUrl.dwHostNameLength]; urlcmpTheUrl.lpszHostName[urlcmpTheUrl.dwHostNameLength] = '\0'; sprintf(TempOut, "Host Name: %s", urlcmpTheUrl.lpszHostName); SendDlgItemMessage(hX,IDC_PreOpenList,LB_ADDSTRING,0, (LPARAM)TempOut); urlcmpTheUrl.lpszHostName[urlcmpTheUrl.dwHostNameLength] = tempChar; } sprintf(TempOut, "Port Number: %d", urlcmpTheUrl.nPort); SendDlgItemMessage(hX,IDC_PreOpenList,LB_ADDSTRING,0, (LPARAM)TempOut); if (urlcmpTheUrl.dwUserNameLength != 0) { tempChar = urlcmpTheUrl.lpszUserName[urlcmpTheUrl.dwUserNameLength]; urlcmpTheUrl.lpszUserName[urlcmpTheUrl.dwUserNameLength] = '\0'; sprintf(TempOut, "User Name: %s", urlcmpTheUrl.lpszUserName); SendDlgItemMessage(hX,IDC_PreOpenList,LB_ADDSTRING,0, (LPARAM)TempOut); urlcmpTheUrl.lpszUserName[urlcmpTheUrl.dwUserNameLength] = tempChar; } if (urlcmpTheUrl.dwPasswordLength != 0) { tempChar= urlcmpTheUrl.lpszPassword[urlcmpTheUrl.dwPasswordLength]; urlcmpTheUrl.lpszPassword[urlcmpTheUrl.dwPasswordLength] = '\0'; sprintf(TempOut, "Password: %s", urlcmpTheUrl.lpszPassword); SendDlgItemMessage(hX,IDC_PreOpenList,LB_ADDSTRING,0, (LPARAM)TempOut); urlcmpTheUrl.lpszPassword[urlcmpTheUrl.dwPasswordLength] = tempChar; } if (urlcmpTheUrl.dwUrlPathLength != 0) { tempChar=urlcmpTheUrl.lpszUrlPath[urlcmpTheUrl.dwUrlPathLength]; urlcmpTheUrl.lpszUrlPath[urlcmpTheUrl.dwUrlPathLength] = '\0'; sprintf(TempOut, "Path: %s", urlcmpTheUrl.lpszUrlPath); SendDlgItemMessage(hX,IDC_PreOpenList,LB_ADDSTRING,0, (LPARAM)TempOut); urlcmpTheUrl.lpszUrlPath[urlcmpTheUrl.dwUrlPathLength] = tempChar; } if(urlcmpTheUrl.dwExtraInfoLength != 0) { tempChar = urlcmpTheUrl.lpszExtraInfo[urlcmpTheUrl.dwExtraInfoLength]; urlcmpTheUrl.lpszExtraInfo[urlcmpTheUrl.dwExtraInfoLength] = '\0'; sprintf(TempOut, "Extra: %s", urlcmpTheUrl.lpszExtraInfo); SendDlgItemMessage(hX,IDC_PreOpenList,LB_ADDSTRING,0, (LPARAM)TempOut); urlcmpTheUrl.lpszExtraInfo[urlcmpTheUrl.dwExtraInfoLength] = tempChar; } return TRUE; } }Creating URLs
The InternetCreateUrl function uses the information in the URL_COMPONENTS structure to create a Uniform Resource Locator.
The components that make up the URL_COMPONENTS structure are the scheme, host name, port number, user name, password, URL path, and additional information (such as search parameters). Each component, except the port number, has a string member that holds the information and a member that holds the length of the string member.
For each required component, the pointer member should contain the address of the buffer holding the information. The length member should be set to zero if the pointer member contains the address of a zero-terminated string; the length member should be set to the string length if the pointer member contains the address of a string that is not zero-terminated. The pointer member of any components that are not required must be set to NULL.
Accessing URLs directly
Gopher, FTP, and HTTP resources on the Internet can be accessed directly by using the InternetOpenUrl, InternetReadFile, and InternetFindNextFile functions. InternetOpenUrl opens a connection to the resource at the URL passed to the function. When this connection is made, there are two possible steps. First, if the resource is a file, InternetReadFile can download it; second, if the resource is a directory, InternetFindNextFile can enumerate the files within the directory (except when using CERN proxies). For more information on InternetReadFile, see Reading files. For more information on InternetFindNextFile, see Finding the next file.
For applications that need to operate through a CERN proxy, InternetOpenUrl can be used to access FTP directories and files. The FTP requests are packaged to appear like an HTTP request, which the CERN proxy would accept.
InternetOpenUrl uses the HINTERNET handle created by the InternetOpen function and the URL of the resource. The URL must include the scheme ("http:", "ftp:", "gopher:", "file:" [for a local file], or "https:" [for hypertext protocol secure]) and network location (such as "www.microsoft.com"). The URL can also include a path (for example, "/windows/feature/") and resource name (for example, "default.htm"). For HTTP or HTTPS requests, additional headers can be included.
InternetQueryDataAvailable, InternetFindNextFile, InternetReadFile, and InternetSetFilePointer (HTTP or HTTPS URLs only) can use the handle that is created by InternetOpenUrl to download the resource.
The following diagram illustrates what handles to use with each function.
The root HINTERNET handle created by InternetOpen is used by InternetOpenUrl. The HINTERNET handle created by InternetOpenUrl can be used by InternetQueryDataAvailable, InternetReadFile, InternetFindNextFile (not shown here), and InternetSetFilePointer (HTTP or HTTPS URLs only).
For more information about HINTERNET handles and the handle hierarchy, see Appendix A: HINTERNET Handles.
The following example connects to the resource by using the InternetOpenUrl function. The sample function then uses the InternetReadFile function to download the resource. The function displays the downloaded resource in the edit box indicated by intCtrlID.
int WINAPI UrlDump(HWND hX, int intCtrlID) { HINTERNET hUrlDump; DWORD dwSize=TRUE; LPSTR lpszData; LPSTR lpszOutPut; LPSTR lpszHolding; int nCounter=1; int nBufferSize; DWORD BigSize=8000; hUrlDump = InternetOpenUrl(hRootHandle, "server.name", NULL, NULL, INTERNET_FLAG_RAW_DATA, 0); do { // Allocate the buffer lpszData =new char[BigSize+1]; // Read the data if(!InternetReadFile(hUrlDump,(LPVOID)lpszData,BigSize,&dwSize)) { ErrorOut(hX,GetLastError(),"InternetReadFile"); delete []lpszData; break; } else { // Add a null terminator to the end of the buffer lpszData[dwSize]='\0'; // Check if all of the data has been read. This should never // get called on the first time through the loop. if (dwSize == 0) { // Write the final data to the textbox SetDlgItemText(hX,intCtrlID,lpszHolding); // Delete the existing buffers. delete [] lpszData; delete [] lpszHolding; break; } // Determine the buffer size to hold the new data and the data // already written to the textbox (if any). nBufferSize = (nCounter*BigSize)+1; // Increment the number of buffers read nCounter++; // Allocate the output buffer lpszOutPut = new char[nBufferSize]; // Make sure the buffer is not the initial buffer if(nBufferSize != int(BigSize+1)) { // Copy the data in the holding buffer strcpy(lpszOutPut,lpszHolding); // Concatenate the new buffer with the output buffer strcat(lpszOutPut,lpszData); // Delete the holding buffer delete [] lpszHolding; } else { // Copy the data buffer strcpy(lpszOutPut, lpszData); } // Allocate a holding buffer lpszHolding = new char[nBufferSize]; // Copy the output buffer into the holding buffer memcpy(lpszHolding,lpszOutPut,nBufferSize); // Delete the other buffers delete [] lpszData; delete [] lpszOutPut; } } while (TRUE); // Close the HINTERNET handle InternetCloseHandle(hUrlDump); // Set the cursor back to an arrow SetCursor(LoadCursor(NULL,IDC_ARROW)); // Return return TRUE; }
Top of Page
© 1997 Microsoft Corporation. All rights reserved. Terms of Use.