A Uniform Resource Locator (URL) is a compact representation of the location and access method for a resource located on the Internet. Each URL consists of a scheme (HTTP, HTTPS, FTP, or Gopher) and a scheme-specific string. This string can also include a combination of a directory path, search string, or name of the resource. The Win32® Internet functions provide the ability to create, combine, break down, and canonicalize URLs. For more information on URLs, see RFC 1738, Uniform Resource Locators (URL). This document can be found at ftp://ftp.isi.edu/in-notes/rfc1738.txt .
The URL functions operate in a task-oriented manner. The content and format of the URL that is given to the function is not verified. The calling application should track the use of these functions to ensure that the data is in the intended format. For example, the InternetCanonicalizeUrl function would convert the character "%" into the escape sequence "%25" when using no flags. If InternetCanonicalizeUrl is used on the canonicalized URL, the escape sequence "%25" would be converted into the escape sequence "%2525", which would not work properly.
The format of all URLs must follow the accepted syntax and semantics in order to access resources through the Internet. Canonicalization is the process of formatting a URL to follow this accepted syntax and semantics.
Characters that must be encoded include any characters that have no corresponding graphic character in the US-ASCII coded character set (hexadecimal 80-FF, which are not used in the US-ASCII coded character set, and hexadecimal 00-1F and 7F, which are control characters), blank spaces, "%" (which is used to encode other characters), and unsafe characters (<, >, ", #, {, }, |, \, ^, ~, [, ], and ').
The following table summarizes the URL functions included with the Win32 Internet functions.
InternetCanonicalizeUrl | Canonicalizes the URL. |
InternetCombineUrl | Combines base and relative URLs. |
InternetCrackUrl | Parses a URL string into components. |
InternetCreateUrl | Creates a URL string from components. |
InternetOpenUrl | Begins retrieving an FTP, Gopher, HTTP, or HTTPS resource. |
Canonicalizing a URL is the process that converts a URL (that might contain unsafe characters such as blank spaces, reserved characters, and so on) into an accepted format.
The InternetCanonicalizeUrl function can be used to canonicalize URLs. This function is very task-oriented, so the application should track its use carefully. InternetCanonicalizeUrl does not verify that the URL passed to it is already canonicalized and that the URL that it returns is valid.
The following five flags control how InternetCanonicalizeUrl handles a particular URL. The flags can be used in combination. If no flags are used, the function encodes the URL by default.
ICU_BROWSER_MODE | Do not encode or decode characters after "#" or "?", and do not remove trailing white space after "?". If this value is not specified, the entire URL is encoded, and trailing white space is removed. |
ICU_DECODE | Convert all %XX sequences to characters, including escape sequences, before the URL is parsed. |
ICU_ENCODE_SPACES_ONLY | Encode spaces only. |
ICU_NO_ENCODE | Do not convert unsafe characters to escape sequences. |
ICU_NO_META | Do not remove meta sequences (such as "." and "..") from the URL. |
The ICU_DECODE flag should be used only on canonicalized URLs, because it assumes that all %XX sequences are escape codes and converts them into the characters indicated by the code. If the URL has a "%" symbol in it that is not part of an escape code, ICU_DECODE still treats it as one. This characteristic might cause InternetCanonicalizeUrl to create an invalid URL.
To use InternetCanonicalizeUrl to return a completely decoded URL, the ICU_DECODE and ICU_NO_ENCODE flags must be specified. This setup assumes that the URL being passed to InternetCanonicalizeUrl has been previously canonicalized.
A relative URL is a compact representation of the location of a resource relative to an absolute base URL. The base URL must be known to the parser and usually includes the scheme, network location, and parts of the URL path. An application can call InternetCombineUrl to combine the relative URL with its base URL. InternetCombineUrl will also canonicalize the resultant URL.
The InternetCrackUrl function separates a URL into its component parts and returns the components indicated by the URL_COMPONENTS structure that is passed to the function.
The components that make up the URL_COMPONENTS structure are the scheme number, host name, port number, user name, password, URL path, and additional information (such as search parameters). Each component, except the scheme and port numbers, has a string member that holds the information, and a member that holds the length of the string member. The scheme and port numbers have only a member that stores the corresponding value; they are both returned on all successful calls to InternetCrackUrl.
To get the value of a particular component in the URL_COMPONENTS structure, the member that stores the string length of that component must be set to a nonzero value. The string member can be either the address of a buffer or NULL.
If the pointer member contains the address of a buffer, the string length member must contain the size of that buffer. InternetCrackUrl returns the component information as a string in the buffer and stores the string length in the string length member.
If the pointer member is set to NULL, the string length member can be set to any nonzero value. InternetCrackUrl stores the address of the first character of the URL string that contains the component information and sets the string length to the number of characters in the remaining part of the URL string that pertains to the component.
All pointer members set to NULL with a nonzero length member point to the appropriate starting point in the URL string. The length stored in the length member must be used to determine the end of the individual component's information.
To finish initializing the URL_COMPONENTS structure properly, the dwStructSize member must be set to the size of the URL_COMPONENTS structure.
The following example returns the components of the URL in the edit box, IDC_PreOpen1, and returns the components to the list box, IDC_PreOpenList. To display only the information for an individual component, this function copies the character immediately after the component's information in the string and temporarily replaces it with a NULL.
int WINAPI Cracker(HWND hX) { URL_COMPONENTS urlcmpTheUrl; int intTestSize = 80; LPSTR lpszUrlIn; LPURL_COMPONENTS lpUrlComp = &urlcmpTheUrl; char TempOut[256]; char tempChar; lpszUrlIn = new char[intTestSize]; GetDlgItemText(hX,IDC_PreOpen1,lpszUrlIn,intTestSize); SendDlgItemMessage(hX,IDC_PreOpenList,LB_RESETCONTENT,0,0); urlcmpTheUrl.dwStructSize = sizeof(urlcmpTheUrl); urlcmpTheUrl.lpszScheme = NULL; urlcmpTheUrl.lpszHostName = NULL; urlcmpTheUrl.lpszUserName = NULL; urlcmpTheUrl.lpszPassword = NULL; urlcmpTheUrl.lpszUrlPath = NULL; urlcmpTheUrl.lpszExtraInfo = NULL; /* The following lines set which components will be displayed. */ urlcmpTheUrl.dwSchemeLength = 1; urlcmpTheUrl.dwHostNameLength = 1; urlcmpTheUrl.dwUserNameLength = 1; urlcmpTheUrl.dwPasswordLength = 1; urlcmpTheUrl.dwUrlPathLength = 1; urlcmpTheUrl.dwExtraInfoLength = 1; if (!InternetCrackUrl(lpszUrlIn,strlen(lpszUrlIn),0, lpUrlComp)) { ErrorOut(hX,GetLastError(),"Cracker"); return FALSE; } else { if (urlcmpTheUrl.dwSchemeLength != 0) { tempChar = urlcmpTheUrl.lpszScheme[urlcmpTheUrl.dwSchemeLength]; urlcmpTheUrl.lpszScheme[urlcmpTheUrl.dwSchemeLength]='\0'; sprintf(TempOut, "Scheme: %s", urlcmpTheUrl.lpszScheme); SendDlgItemMessage(hX,IDC_PreOpenList,LB_ADDSTRING,0, (LPARAM)TempOut); urlcmpTheUrl.lpszScheme[urlcmpTheUrl.dwSchemeLength]= tempChar; } sprintf(TempOut, "Scheme number: %d", urlcmpTheUrl.nScheme); SendDlgItemMessage(hX,IDC_PreOpenList,LB_ADDSTRING,0, (LPARAM)TempOut); if (urlcmpTheUrl.dwHostNameLength != 0) { tempChar = urlcmpTheUrl.lpszHostName[urlcmpTheUrl.dwHostNameLength]; urlcmpTheUrl.lpszHostName[urlcmpTheUrl.dwHostNameLength] = '\0'; sprintf(TempOut, "Host Name: %s", urlcmpTheUrl.lpszHostName); SendDlgItemMessage(hX,IDC_PreOpenList,LB_ADDSTRING,0, (LPARAM)TempOut); urlcmpTheUrl.lpszHostName[urlcmpTheUrl.dwHostNameLength] = tempChar; } sprintf(TempOut, "Port Number: %d", urlcmpTheUrl.nPort); SendDlgItemMessage(hX,IDC_PreOpenList,LB_ADDSTRING,0, (LPARAM)TempOut); if (urlcmpTheUrl.dwUserNameLength != 0) { tempChar = urlcmpTheUrl.lpszUserName[urlcmpTheUrl.dwUserNameLength]; urlcmpTheUrl.lpszUserName[urlcmpTheUrl.dwUserNameLength] = '\0'; sprintf(TempOut, "User Name: %s", urlcmpTheUrl.lpszUserName); SendDlgItemMessage(hX,IDC_PreOpenList,LB_ADDSTRING,0, (LPARAM)TempOut); urlcmpTheUrl.lpszUserName[urlcmpTheUrl.dwUserNameLength] = tempChar; } if (urlcmpTheUrl.dwPasswordLength != 0) { tempChar= urlcmpTheUrl.lpszPassword[urlcmpTheUrl.dwPasswordLength]; urlcmpTheUrl.lpszPassword[urlcmpTheUrl.dwPasswordLength] = '\0'; sprintf(TempOut, "Password: %s", urlcmpTheUrl.lpszPassword); SendDlgItemMessage(hX,IDC_PreOpenList,LB_ADDSTRING,0, (LPARAM)TempOut); urlcmpTheUrl.lpszPassword[urlcmpTheUrl.dwPasswordLength] = tempChar; } if (urlcmpTheUrl.dwUrlPathLength != 0) { tempChar=urlcmpTheUrl.lpszUrlPath[urlcmpTheUrl.dwUrlPathLength]; urlcmpTheUrl.lpszUrlPath[urlcmpTheUrl.dwUrlPathLength] = '\0'; sprintf(TempOut, "Path: %s", urlcmpTheUrl.lpszUrlPath); SendDlgItemMessage(hX,IDC_PreOpenList,LB_ADDSTRING,0, (LPARAM)TempOut); urlcmpTheUrl.lpszUrlPath[urlcmpTheUrl.dwUrlPathLength] = tempChar; } if(urlcmpTheUrl.dwExtraInfoLength != 0) { tempChar = urlcmpTheUrl.lpszExtraInfo[urlcmpTheUrl.dwExtraInfoLength]; urlcmpTheUrl.lpszExtraInfo[urlcmpTheUrl.dwExtraInfoLength] = '\0'; sprintf(TempOut, "Extra: %s", urlcmpTheUrl.lpszExtraInfo); SendDlgItemMessage(hX,IDC_PreOpenList,LB_ADDSTRING,0, (LPARAM)TempOut); urlcmpTheUrl.lpszExtraInfo[urlcmpTheUrl.dwExtraInfoLength] = tempChar; } return TRUE; } }
The InternetCreateUrl function uses the information in the URL_COMPONENTS structure to create a Uniform Resource Locator.
The components that make up the URL_COMPONENTS structure are the scheme, host name, port number, user name, password, URL path, and additional information (such as search parameters). Each component, except the port number, has a string member that holds the information, and a member that holds the length of the string member.
For each required component, the pointer member should contain the address of the buffer holding the information. The length member should be set to zero if the pointer member contains the address of a zero-terminated string; the length member should be set to the string length if the pointer member contains the address of a string that is not zero-terminated. The pointer member of any components that are not required must be set to NULL.
Gopher, FTP, and HTTP resources on the Internet can be accessed directly by using the InternetOpenUrl, InternetReadFile, and InternetFindNextFile functions. InternetOpenUrl opens a connection to the resource at the URL passed to the function. When this connection is made, there are two possible steps. First, if the resource is a file, InternetReadFile can download it; second, if the resource is a directory, InternetFindNextFile can enumerate the files within the directory (except when using CERN proxies). For more information on InternetReadFile, see Reading Files. For more information on InternetFindNextFile, see Finding the Next File.
For applications that need to operate through a CERN proxy, InternetOpenUrl can be used to access FTP directories and files. The FTP requests are packaged to appear like an HTTP request, which the CERN proxy would accept.
InternetOpenUrl uses the HINTERNET handle created by the InternetOpen function and the URL of the resource. The URL must include the scheme (http:, ftp:, gopher:, file: [for a local file], or https: [for hypertext protocol secure]) and network location (such as www.microsoft.com). The URL can also include a path (for example, /isapi/gomscom.asp?TARGET=/windows/feature/) and resource name (for example, default.htm). For HTTP or HTTPS requests, additional headers can be included.
InternetQueryDataAvailable, InternetFindNextFile, InternetReadFile, and InternetSetFilePointer (HTTP or HTTPS URLs only) can use the handle that is created by InternetOpenUrl to download the resource.
The following diagram illustrates what handles to use with each function.
The root HINTERNET handle created by InternetOpen is used by InternetOpenUrl. The HINTERNET handle created by InternetOpenUrl can be used by InternetQueryDataAvailable, InternetReadFile, InternetFindNextFile (not shown here), and InternetSetFilePointer (HTTP or HTTPS URLs only).
For more information about HINTERNET handles and the handle hierarchy, see Appendix A: HINTERNET Handles.
The following example connects to the resource by using the InternetOpenUrl function. The sample function then uses the InternetReadFile function to download the resource. The function displays the downloaded resource in the edit box indicated by intCtrlID.
int WINAPI UrlDump(HWND hX, int intCtrlID) { HINTERNET hUrlDump; DWORD dwSize=TRUE; LPSTR lpszData; LPSTR lpszOutPut; LPSTR lpszHolding; int nCounter=1; int nBufferSize; DWORD BigSize=8000; hUrlDump = InternetOpenUrl(hRootHandle, "server.name", NULL, NULL, INTERNET_FLAG_RAW_DATA, 0); do { // Allocate the buffer. lpszData =new char[BigSize+1]; // Read the data. if(!InternetReadFile(hUrlDump,(LPVOID)lpszData,BigSize,&dwSize)) { ErrorOut(hX,GetLastError(),"InternetReadFile"); delete []lpszData; break; } else { // Add a null terminator to the end of the buffer. lpszData[dwSize]='\0'; // Check if all of the data has been read. This should // never get called on the first time through the loop. if (dwSize == 0) { // Write the final data to the textbox. SetDlgItemText(hX,intCtrlID,lpszHolding); // Delete the existing buffers. delete [] lpszData; delete [] lpszHolding; break; } // Determine the buffer size to hold the new data and the data // already written to the textbox (if any). nBufferSize = (nCounter*BigSize)+1; // Increment the number of buffers read. nCounter++; // Allocate the output buffer. lpszOutPut = new char[nBufferSize]; // Make sure the buffer is not the initial buffer. if(nBufferSize != int(BigSize+1)) { // Copy the data in the holding buffer. strcpy(lpszOutPut,lpszHolding); // Concatenate the new buffer with the output buffer. strcat(lpszOutPut,lpszData); // Delete the holding buffer. delete [] lpszHolding; } else { // Copy the data buffer. strcpy(lpszOutPut, lpszData); } // Allocate a holding buffer. lpszHolding = new char[nBufferSize]; // Copy the output buffer into the holding buffer. memcpy(lpszHolding,lpszOutPut,nBufferSize); // Delete the other buffers. delete [] lpszData; delete [] lpszOutPut; } } while (TRUE); // Close the HINTERNET handle. InternetCloseHandle(hUrlDump); // Set the cursor back to an arrow. SetCursor(LoadCursor(NULL,IDC_ARROW)); // Return. return TRUE; }