NAME
Lingua::JA::WebIDF - WebIDF calculator
SYNOPSIS
use Lingua::JA::WebIDF;
my $webidf = Lingua::JA::WebIDF->new(%config);
print $webidf->idf("東京"); # low
print $webidf->idf("スリジャヤワルダナプラコッテ"); # high
DESCRIPTION
Lingua::JA::WebIDF calculates WebIDF weight.
WebIDF(Inverse Document Frequency) weight represents the rarity of a
word on the Web. The WebIDF weight of a rare word is high. Conversely,
the WebIDF weight of a common word is low.
IDF is based on the intuition that a query term which occurs in many
documents is not a good discriminator and should be given less weight
than one which occurs in few documents.
METHODS
new( %config || \%config )
Creates a new Lingua::JA::WebIDF instance.
The following configuration is used if you don't set %config.
KEY DEFAULT VALUE
----------- ---------------
idf_type 1
api 'YahooPremium'
appid undef
driver 'TokyoCabinet'
df_file './df.tch'
fetch_df 0
expires_in 365
documents 250_0000_0000
Furl_HTTP undef
verbose 1
idf_type => 1 || 2 || 3
The type1 is the most commonly cited form of IDF.
N
idf(t_i) = log ----- (1)
n_i
N : the number of documents
n_i: the number of documents which contain term t_i
t_i: term
The type2 is a simple version of the RSJ weight.
N - n_i + 0.5
idf(t_i) = log ---------------- (2)
n_i + 0.5
The type3 is a modification of (2).
N + 0.5
idf(t_i) = log ----------- (3)
n_i + 0.5
api => 'Yahoo' || 'YahooPremium'
Uses the specified Web API when fetches WebDF(Document Frequency).
driver => 'Storable' || 'TokyoCabinet'
Fetches and saves WebDF with the specified driver.
df_file => $path
Saves WebDF to the specified path.
In order to reduce access to Web API, please download a big df file
from .
I recommend that you change the file depending on the type of Web
API you specifies because WebDF may be different depending on it.
fech_df => 0
Never fetches WebDF from the Web if 0 is specified.
If the WebDF you want to know has already saved, it is used. If it
is not so, returns undef.
expires_in => $days
If 365 is specified, WebDF expires in 365 days after fetches it.
Furl_HTTP => \%option
Sets the options of Furl::HTTP->new.
If you want to use proxy server, you have to use this option.
verbose => 1 || 0
If 1 is specified, shows verbose error messages.
idf($word)
Calculates the WebIDF weight of $word via df($word) method.
df($word)
Fetches the WebDF of $word.
If the WebDF of $word has not been saved yet or has expired, fetches it
by using the Web API you specified and saves it.
If the WebDF of $word has expired and fetch_df is 0, the expired WebDF
is used.
db_open($mode)
Opens the database file which is located in $path.
If you use TokyoCabinet, you have to open the database file via this
method before idf|df|db_close|purge method is called.
$mode is 'read' or 'write'.
db_close
Closes the database file which is located in $path.
This method is called automatically when the object is destroyed, so you
might not need to use this method explicitly.
purge($expires_in)
Purges old data in df_file.
If 365 is specified, the data which 365 days elapsed are purged.
AUTHOR
pawa
SEE ALSO
Lingua::JA::TFWebIDF
Lingua::JA::WebIDF::Driver::TokyoTyrant
Yahoo API:
Tokyo Cabinet:
S. Robertson, Understanding inverse document frequency: on theoretical
arguments for IDF. Journal of Documentation 60, 503-520, 2004.
LICENSE
This library is free software; you can redistribute it and/or modify it
under the same terms as Perl itself.