NAME
DESCRIPTION
LIBRARY MODE
URI_MODE_CANNONICAL|URI_MODE_ERROR_STDERR.
URI_MODE_CANNONICAL
URI_MODE_LOWER_SCHEME
URI_MODE_ERROR_STDERR
URI_MODE_FIELD_MALLOC
URI_MODE_FURI_MD5
URI_MODE_URI_STRICT
URI_MODE_URI_STRICT_SCHEME
URI_MODE_FLAG_DEFAULT
STRUCTURE AND
ALLOCATION
FUNCTIONS
uri_t* uri_alloc_1()
uri_t*
uri_alloc(char* uri, int uri_length)
uri_t*
uri_object(char* uri, int uri_length)
int
uri_realloc(uri_t* object, char* uri, int uri_length)
void uri_free(uri_t*
object)
uri_t* uri_abs(uri_t* base, char* relative_string, int
relative_length)
uri_abs_1(uri_t*
base, uri_t* relative)
int uri_info(uri_t*
object)
char* uri_scheme(uri_t*
object)
char* uri_host(uri_t*
object)
char* uri_port(uri_t*
object)
char* uri_path(uri_t*
object)
char* uri_params(uri_t*
object)
char* uri_query(uri_t*
object)
char* uri_frag(uri_t*
object)
char* uri_user(uri_t*
object)
char* uri_passwd(uri_t*
object)
char* uri_netloc(uri_t*
object)
char*
uri_auth_netloc(uri_t* object)
char* uri_auth(uri_t*
object)
char*
uri_all_path(uri_t* object)
void
uri_info_set(uri_t* object, int value)
void
uri_scheme_set(uri_t* object, char* value)
void
uri_host_set(uri_t* object, char* value)
void
uri_params_set(uri_t* object, char* value)
void
uri_query_set(uri_t* object, char* value)
void
uri_user_set(uri_t* object, char* value)
void
uri_passwd_set(uri_t* object, char* value)
void
uri_copy(uri_t* to, uri_t* from)
uri_t* uri_clone(uri_t*
from)
void uri_clear(uri_t*
object)
void
uri_set_root(const char* root)
const char*
uri_get_root()
char* uri_furi(uri_t*
object)
char* uri_uri(uri_t*
object)
void uri_string(uri_t* object, char** stringp, int* string_sizep,
int flags)
char*
uri_escape(char* string, char* range)
char*
uri_unescape(char* string)
char* uri_cannonicalize_string(char* uri, int uri_length, int
flag)
uri_t*
uri_cannonical(uri_t* object)
int
uri_consistent(uri_t* object)
HTTP FUNCTIONS
char* uri_robots(uri_t*
object)
CANNONICAL FORM
http://www.foo.com/file.html.
ERROR HANDLING
STRICTNESS
FURI
EXAMPLES
Show cannonical form of
URI
Show the host and
port of URI (netloc)
Change the
query part of URI and show it
ADDING NEW SCHEMES
AUTHOR
SEE ALSO
NAME
uri - a set of functions
to manipulate URIs
DESCRIPTION
The header file for the
library is #include
<uri.h> and the library may be
linked using -luri.
uri is a library that analyses URIs and transform them. It is
designed to be fast and occupy as few memory as possible. The basic
usage of this library is to transform an URI into a structure with
one field for each component of the URI and vice versa.
LIBRARY MODE
The library behaviour is
controled by the flags described bellow. The default set of flag
is
URI_MODE_CANNONICAL|URI_MODE_ERROR_STDERR.
URI_MODE_CANNONICAL
|
All objects store URI in cannonical
form.
|
URI_MODE_LOWER_SCHEME
|
The scheme of the URI is always converted to
lower case.
|
URI_MODE_ERROR_STDERR
|
If an error occurs, the error string is printed
on the STDERR chanel.
|
URI_MODE_FIELD_MALLOC
|
Each field may have its own malloc'd space. When
the caller set a field it can assume the content of the field is
saved in the object. Otherwise when the caller sets a field it must
make sure that the memory containing the value of the field will
not be freed before the object is deallocated.
|
URI_MODE_FURI_MD5
|
Use MD5 key calculated from the URL as a path
name instead of the readable path name described in FURI chapter
below. For example http://www.foo.com/ is transformed into the MD5
key 33024cec6160eafbd2717e394b5bc201 and the corresponding FURI is
33/02/4c/ec6160eafbd2717e394b5bc201.
|
URI_MODE_URI_STRICT
Behave in strict mode (see
STRICTNESS below).
URI_MODE_URI_STRICT_SCHEME
Behave in strict mode (see
STRICTNESS below).
URI_MODE_FLAG_DEFAULT
|
The default mode of the library.
|
STRUCTURE AND ALLOCATION
The uri_t type is a structure
describing the URI. Access functions are provided and should be
used to get the values of the fields and set new values. All the
fields are character strings whose size is exactly the size of the
string they contain. One can safely override the values contained
in the fields, as long as the replacement string has a size lower
or equal to the original size. If the replacement string is larger,
the caller must use a buffer of its own.
If the flag URI_MODE_FIELD_MALLOC is not
set, which is the default, the allocation policy for an
uri_t object is
minimal. When an object is allocated using uri_alloc, memory is
allocated by the library to store the object. This memory will be
released when the object is freed using uri_free. When a field is
set, the pointer is stored in the object and no copy of the string
is kept. It is the responsibility of the caller to make sure that
the string will live as long
as the object lives.
This policy is designed to prevent allocation as much as possible.
Let's say you have a program that will operate on 50 000 URLs, only
one malloc and a few realloc will be necessary instead of 50 000
malloc/free multiplied by the number of fields of the structure.
The loop will look like this:
|
/*
* Alloc an empty object.
*/
uri_t* uri = uri_alloc_1();
for(i = 0; i < 50000; i++) {
/*
* Reuse the object for another url, object grow
* only if needed because the url is larger than
* any previously seen url.
*/
uri_realloc(uri, url[i], strlen(url[i]));
... do something on uri ...
/*
* Print the url on stdout
*/
printf("%sn", uri_uri(uri));
}
|
If the flag URI_MODE_FIELD_MALLOC is set,
each field will have a separatly allocated space, if necessary. The
caller may assume that the object is always self contained and does
not depend on externally allocated string. Each set function
(uri_scheme_set, uri_host_set etc.) allocated the necessary space
and duplicate the string given in argument. The info field contains flags
that record which fields contain a malloc'd space and which does
not (URI_INFO_M_* flags). This information is only valid between
two calls of the library functions. For instance uri_cannonicalize will
reorganize allocated space. This policy is used for integration of
the library into scripting langages such as Perl.
|
corresponding define that have the following
meaning.
|
URI_INFO_CANNONICAL Set if
the URI is in cannonical form.
URI_INFO_RELATIVE Set if the
URI is a relative URI (does not start with
{http,..}://).
URI_INFO_RELATIVE_PATH Set if
the URI is a relative URI and the path does not start with a
/.
URI_INFO_PARSED Set if the
URI was successfully parsed. If this flag is not set the content of
the object is undefined.
URI_INFO_ROBOTS Set if the
URI is an http robots.txt file.
URI_INFO_M_* There is such a
flag for each field of the uri_t structure. If the flag
is set, the memory pointed by this field has been allocated
by malloc.
|
scheme host
port
path
|
|
The scheme of the URI (http, ftp, file or
news).
The host name part of the URI.
The port number associated to host, if
any.
The path name of the URI.
|
|
|
params
|
|
The parameters of the URI (i.e. what is found
after the ; in the path).
|
|
query
frag
|
|
The query part of a cgi-bin call (i.e. what
is found after the ? in the path). The fragement of the document
(i.e. what is found after the # in the path).
|
|
user
passwd
|
|
If authentication information is set, the user
name. If authentication information is set, the
password.
|
FUNCTIONS
uri_t* uri_alloc_1()
|
Allocate an empty object that must be filled with
the uri_realloc function.
|
uri_t* uri_alloc(char* uri, int uri_length)
|
The uri is splitted into fields and the corresponding uri_t structure is returned.
The structure is allocated using malloc. The URI is put in
cannonical form. If it cannot be put in cannonical form an error
message is printed on stderr and a null pointer is
returned.
|
uri_t* uri_object(char* uri, int uri_length)
|
The uri is splitted into fields and the corresponding uri_t structure is returned.
The returned structure is statically allocated and must not be
freed. The URI is put in cannonical form. If it cannot be put in
cannonical form an error message is printed on stderr and a null
pointer is returned.
|
int uri_realloc(uri_t* object, char* uri, int uri_length)
|
The uri is splitted into fields in the previously allocated
object structure. The
URI is put in cannonical form and URI_CANNONICAL is returned. If it cannot be put in cannonical form, nothing is done
and URI_NOT_CANNONICAL is returned.
|
void uri_free(uri_t* object)
|
The object
previously allocated by uri_alloc is
deallocated.
|
uri_t* uri_abs(uri_t* base, char* relative_string, int
relative_length)
|
Transform the relative URI relative_string into an
absolute URI using base as the base URI. The returned uri_t object is allocated
statically and must not be freed.
|
uri_abs_1(uri_t* base, uri_t* relative)
|
Transform the relative URI relative into an absolute URI
using base as
the base URI. The returned uri_t object is allocated
statically and must not be freed.
|
int uri_info(uri_t* object)
|
returns the content of the info field.
|
char* uri_scheme(uri_t* object)
|
returns the content of the scheme field.
|
char* uri_host(uri_t* object)
|
returns the content of the host field.
|
char* uri_port(uri_t* object)
returns the value of
the port field
of the object. If the port field is empty, returns the default port for the
corresponding scheme. For instance, if the scheme is http the 80 string is returned. The
returned string is statically allocated and must not be
freed.
char* uri_path(uri_t* object)
returns the content of
the path field.
char* uri_params(uri_t* object)
returns the content of
the params field.
char* uri_query(uri_t* object)
returns the content of
the path field.
char* uri_frag(uri_t* object)
returns the content of
the frag field.
char* uri_user(uri_t* object)
returns the content of
the user field.
char* uri_passwd(uri_t* object)
returns the content of
the passwd field.
char* uri_netloc(uri_t* object)
returns a concatenation
of the host and port field, separated by a :. If the host field is not set, the
null pointer is returned and a message is printed on stderr. The
returned string is statically allocated and must not be
freed.
char* uri_auth_netloc(uri_t* object)
returns a concatenation
of the host and port field, separated by a :. If the user field is set, the
user and passwd fields are prepended
to the netloc, separated by a @. If the host field is not set, the
null pointer is returned and error condition is set. The returned
string is statically allocated and must not be freed.
char* uri_auth(uri_t* object)
returns a concatenation
of the user and passwd field, separated by a : or an empty string if any
of them is not set. The returned string is statically allocated and
must not be freed.
char* uri_all_path(uri_t* object)
returns a concatenation
of the path, params and query fields in the form /path;params?query. Note that a leading slash is only
prepended to the returned value if the
object is not a
relative URI. The returned string is statically allocated and must
not be freed.
void uri_info_set(uri_t* object, int value)
set the info field to
value.
void uri_scheme_set(uri_t* object, char* value)
set the scheme field to
value. The URI_INFO_RELATIVE
is updated according to the new value.
void uri_host_set(uri_t* object, char* value)
set the host field to
value. The URI_INFO_RELATIVE
is updated according to the new value.
void uri_params_set(uri_t* object, char* value)
set the params field to
value.
void uri_query_set(uri_t* object, char* value)
set the query field to
value.
void uri_user_set(uri_t* object, char* value)
set the user field to
value.
void uri_passwd_set(uri_t* object, char* value)
set the passwd field to
value.
void uri_copy(uri_t* to, uri_t* from)
copy the content of
object from into
object to.
uri_t* uri_clone(uri_t* from)
creates a new object
containing the same data as from. The returned object
must be freed using uri_free.
void uri_clear(uri_t* object)
clear all information
contained in object.
void uri_set_root(const char* root)
Set the path that
uri_furi will prepend
to the FURI. By default it is the empty string.
const char* uri_get_root()
Get the path set
by uri_set_root or empty string.
char* uri_furi(uri_t* object)
returns a string
containing the FURI (File equivalent of an URI) built from
object. The returned
string is statically allocated and must not be freed.
char* uri_uri(uri_t* object)
returns a string
containing the URI built from object. The returned string
is statically allocated and must not be freed.
void uri_string(uri_t* object, char** stringp, int*
string_sizep, int flags)
Build a string
representation of object in stringp according to flags.
Possible values of flags is described in the
uri_cannonicalize_string function. Upon return the stringp pointer points to a
static array of stringp_size
bytes allocated with malloc. If stringp is not null it must
point to a buffer allocated with malloc and is reallocated to fit
the needs of the string conversion. This function is the backend of
all object to string translation functions.
char* uri_escape(char* string, char* range)
return a statically
allocated copy of string with all characters found in the the range string transformed in
escaped form (%xx). A few examples of range argument are defined:
URI_ESCAPE_RESERVED, URI_ESCAPE_PATH, URI_ESCAPE_QUERY, and
uri_escape_unsafe.
char* uri_unescape(char* string)
|
return a statically allocated copy of
string with all escape
sequences (%xx) transformed to characters.
|
char* uri_cannonicalize_string(char* uri, int uri_length, int
flag)
|
returns the cannonical form of the uri given in argument. The
cannonical form is formatted according to the value of flag. Values of flag are bits
that can be ored together.
URI_STRING_FURI_STYLE return a FURI, URI_STRING_URI_STYLE return
an URI, URI_STRING_ROBOTS_STYLE
return the corresponding robots.txt URI,
URI_STRING_URI_NOHASH_STYLE do not include the frag in the returned string.
Returns 0 if uri is malformed.
|
uri_t* uri_cannonical(uri_t* object)
|
returns an object containing the cannonical form
of object. If
the
|
URI_MODE_CANNONICAL flag is set, the object itself is returned.
int uri_consistent(uri_t* object)
|
Returns 0 if object contains unparsable
URL, returns != 0 if object
contains a well formed URL. Must be called after a
set of field changes to reset flags and ensure that modified URL is
well formed.
|
HTTP FUNCTIONS
char* uri_robots(uri_t* object)
|
returns a string containing the URI of the
robots.txt file corresponding to the URI contained in object. For
instance, if the URI contained in object is http://www.foo.com/dir/dir/file.html the returned string will be http://www.foo.com/robots.txt. The returned string is statically allocated and must not
be freed.
|
CANNONICAL FORM
The cannonical form of
an URI is an arbitrary choice to code all the possible variations
of the same URI in one string. For instance
http://www.foo.com/abc"def.html will be transformed to
http://www.foo.com/abc%22def.html. Most of the transformations
follow the instructions found in draft-fielding-uri-syntax-04
but some of them don't.
Additionally, when the
path of the URI contains dots and double dots, it is reduced. For
instance http://www.foo.com/dir/.././file.html will be transformed to
http://www.foo.com/file.html.
If the URI_MODE_CANNONICAL flag is
set, the uri_t object always contains the cannonical form of the URL. The
original form is lost.
If the URI_MODE_CANNONICAL flag is
not set, the cannonical form of the URI is stored in a separate
object. The uri_t object contains the original form of the URI. It takes
more memory to store but may be usefull in some
situations.
ERROR HANDLING
When an error occurs
(URI cannot be cannonicalized or parsed, for instance), the global
variable uri_errstr contains the full text of the error message. This variable
is never reset by the library functions if no error
occurs.
Additionally, the error
string may be printed on the error chanel (STDERR) if the
URI_MODE_ERROR_STDERR flag is set. This is the default.
STRICTNESS
The draft describing URI
syntax (draft-fielding-uri-syntax-04) specifies
that an URI of the
type http:g may be interpreted in two different ways. If the
URI_MODE_URI_STRICT flag
is set, the library interprets it as an absolute URI, otherwise it
is a relative URI.
If the URI_MODE_URI_STRICT is not
set, the URI_MODE_URI_STRICT_SCHEME
may be set so that a relative URI containing a scheme is
interpreted as an absolute URI only
if the scheme is different from the scheme of the base
URI.
FURI
It is sometimes
convinient to convert an URI into a path name. Some functions of
the uri
library provide such a conversion (uri_furi for instance).
These path names are called FURI
(File equivalent of an URI) for short. Here is a description of the
transformation.
|
http://www.ina.fr:700/imagina/index.html#queau
|
|
|
|
|
|
|
|
____________/ ________________/____/
| | lost
| |
| |
|
|
/
|
|
| |
/^^^^^^^^^^^^^/^^^^^^^^^^^^^^^^\
|
http/www.ina.fr:700/imagina/index.html
EXAMPLES
Show cannonical form of URI
char* uri =
"http://www.foo.com/";
uri = uri_cannonicalize_string(uri, strlen(uri),
URI_STRING_URI_STYLE); if(uri) printf("uri = %sn",
uri);
Show the host and port of URI (netloc)
char* uri =
"http://www.foo.com:7000/";
uri_t* uri_object = uri_object(uri, strlen(uri));
if(uri_object) printf("netloc = %sn",
uri_netloc(uri_object));
Change the query part of URI and show it
char* uri =
"http://www.foo.com/cgi-bin/bar?param=1";
uri_t* uri_object = uri_object(uri, strlen(uri));
if(uri_object) {
|
uri_query_set(uri_object,
"param=2");
|
|
printf("uri = %sn",
uri_uri(uri_object));
|
}
ADDING NEW SCHEMES
Add the name of the
scheme in the SCHEMES file. If nothing else this will bind the
scheme
to a generic parser following the URI parsing rules. If you want to
define specific behaviour
for this scheme, mimic the uri_scheme_http.c file and recompile. If
gperf(1) complains
because it has conflicts you'll have to play with the -k option
in order to find a working range
that does not conflict and takes a few space as
possible.
AUTHOR
Loic Dachary
loic@senga.org
SEE ALSO
draft-fielding-uri-syntax-04