Page 1 of 1

Stemming based on n-grams with Zend Search Lucene?

Posted: Sat Apr 18, 2009 1:34 am
by sebastian_lutze
Hi people,

currently i try to implement a stemming functionality based on n-grams (http://www.clef-campaign.org/2008/worki ... MC2008.pdf) into "Zend Search Lucene". I wasn't very satisfied from the results of the porter-stemmer on queries especially against multilingual content (in my case german and english).

Therefore i extended the class Zend_Search_Lucene_Analysis_Analyzer_Common and modified the method "nextToken()" a little:

Code: Select all

   public function nextToken()
    {
       
    	if ($this->_input === null) {
    		$this->_tokenBuffer = array();
            return null;
        } 
        // if there are n-grams in tokenBuffer, return next one
        if (sizeof($this->_tokenBuffer) > 0) {
        	$token = array_shift($this->_tokenBuffer);
       	// if not ...
        } else {
        	
	        do {
	            	
	            if (!preg_match(
	            	'/[\p{L}\p{N}]+/u', $this->_input, $match, PREG_OFFSET_CAPTURE, $this->_bytePosition)) {
	            	$this->_tokenBuffer = array();
	                return null;
	                // It covers both cases a) there are no matches (preg_match(...) === 0)
	                // b) error occured (preg_match(...) === FALSE)
	            }
	            
	            // matched string
	            $matchedWord = $match[0][0];
	            
	            // binary position of the matched word in the input stream
	            $binStartPos = $match[0][1];
	            
	            // character position of the matched word in the input stream
	            $startPos = $this->_position + iconv_strlen(substr(
	            	$this->_input, $this->_bytePosition, $binStartPos - $this->_bytePosition), 'UTF-8'
	            	);
	            	
	            // character postion of the end of matched word in the input stream
	            $endPos = $startPos + iconv_strlen($matchedWord, 'UTF-8');
	
	            $this->_bytePosition = $binStartPos + strlen($matchedWord);
	            $this->_position     = $endPos;	
	            
		        $token = $this->normalize(
	            	new Zend_Search_Lucene_Analysis_Token($matchedWord, $startPos, $endPos)
	            	);
	            	
	            if (!is_null($token)) {
	            	// fill tokenBuffer with n-grams, use the same position for all generated n-grams 
                        // but i don't know if that is the right way
	            	$this->_fillTokenBuffer($token, $startPos, $endPos);
	            	// return first one
	            	$token = array_shift($this->_tokenBuffer);
	            }
	
	        } while ($token === null); // try again if token is skipped        	
        }

        return $token; 
    }
Furthermore the method to generate n-grams from words:

Code: Select all

    protected function _fillTokenBuffer(Zend_Search_Lucene_Analysis_Token $token, $startPos, $endPos)
    {
    	$matchedWord = $token->getTermText();
    	//echo "matchedWord: $matchedWord<br>";
        // short words and numbers should not be tokenized to n-grams
    	if (iconv_strlen($matchedWord, 'UTF-8') <= $this->_minWordSize || is_numeric($matchedWord)) {
            $this->_tokenBuffer[]= new Zend_Search_Lucene_Analysis_Token($matchedWord, $startPos, $endPos);
        } else {
            // generate n-grams and fill the tokenBuffer with Zend_Search_Lucene_Analysis_Token
	    	$matchedWord = '_' . $matchedWord . '_';
	    	$length = iconv_strlen($matchedWord, 'UTF-8');
	    	for ($pos = 0; $pos < $length; $pos++) { 
	    		for ($chars = 0; $chars < $this->_maxNGramSize; $chars++) {
	    			if (($pos + $chars) < $length) {
	    				$nGram = mb_substr($matchedWord, $pos, $chars + 1, 'UTF-8');
	    				if (iconv_strlen($nGram, 'UTF-8') >= $this->_minNGramSize) {
				            $this->_tokenBuffer[]= new Zend_Search_Lucene_Analysis_Token(
				            	$nGram, $startPos, $endPos
				            	);
				            //echo "n-gram: $nGram<br>";
	    				}
	    			}
	    		}
	    	}        	
        }   		
    }
It seems to work very well, for example the string "statistical science" will be tokenized into:

_stat
stati
tatis
atist
tisti
istic
stica
tical
ical_
_scie
scien
cienc
ience
ence_

First experiments ran very promising. For example a search for "statistical science" also found documents with "scientifical statistic" in it.

Now my problem:

The result of

Code: Select all

Zend_Search_Lucene_Search_QueryParser::parse("statistical science")
is:

(+_stat +stati +tatis +atist +tisti +istic +stica +tical +ical_) (+_scie +scien +cienc +ience +ence_)

Basically that is very fine, because it will result in an exact match. But, with this behavior the application won't take any advantages from the increased amount of index-terms in the index caused by tokenizing into n-grams. Therefore the query-string should be processed into:

(_stat stati tatis atist tisti istic stica tical ical_) (_scie scien cienc ience ence_)

Even better would be an additional operator to control the behavior.

Code: Select all

Zend_Search_Lucene_Search_QueryParser::parse("statistical science#")
should be processed into:

(+_stat +stati +tatis +atist +tisti +istic +stica +tical +ical_) (_scie scien cienc ience ence_)


Any ideas how to extend/modify "Zend Search Lucene" to solve this problem?

Or any thoughts, ideas or informations concerning "stemming based on n-grams with Zend Search Lucene"?

To be honest: I really don't know if this whole "n-gram-thing" is realizable with Zend Lucene or even complete xxxx. ;)


Many thanks & best regards,

Sebastian from Leipzig, Germany

:)


PS: sorry for my bad English.

Re: Stemming based on n-grams with Zend Search Lucene?

Posted: Sat Apr 18, 2009 2:56 pm
by sebastian_lutze
Hi people,

after a sleepless night, a lot of coffee, blood and sweat and repeated "banging my head against a wall" i discovered some interesting lines of code in the class "Zend_Search_Lucene_Search_Query_MultiTerm":

Code: Select all

    public function optimize(Zend_Search_Lucene_Interface $index)
    {
        $terms = $this->_terms;
        $signs = $this->_signs;

        foreach ($terms as $id => $term) {
            if (!$index->hasTerm($term)) {
                if ($signs === null  ||  $signs[$id] === true) {
                    // Term is required
                    return new Zend_Search_Lucene_Search_Query_Empty();
                } else {
                    // Term is optional or prohibited
                    // Remove it from terms and signs list
                    unset($terms[$id]);
                    unset($signs[$id]);
                }
            }
        }
        .....
Especially the lines:

Code: Select all

                if ($signs === null  ||  $signs[$id] === true) {
                    // Term is required
                    return new Zend_Search_Lucene_Search_Query_Empty();
                } else {
                ....
"Term is required"(???) if

$signs === null

I thought if there is no sign a query-term is always optional?
Furthermore the result of

Code: Select all

var_dump($signs[$id]);
is null, too.

I changed the code into:

Code: Select all

            	if ($signs[$id] === true) {
                    // Term is required
                    return new Zend_Search_Lucene_Search_Query_Empty();
                }
... and everything work fine.

But now i am wondering if there could be some "unexpected effects" from this modification?


Many thanks & best regards,

Sebastian from Leipzig, Germany

:)


PS: sorry for my bad English.

Re: Stemming based on n-grams with Zend Search Lucene?

Posted: Mon Apr 20, 2009 11:20 pm
by sebastian_lutze
Hi people,

after another sleepless night, a lot of coffee, nightmares, etc., i discovered the REAL reason,
my problem was found in the class "Zend_Search_Lucene_Search_QueryEntry_Term":

Code: Select all

    
    /**
     * Transform entry to a subquery
     *
     * @param string $encoding
     * @return Zend_Search_Lucene_Search_Query
     * @throws Zend_Search_Lucene_Search_QueryParserException
     */
    public function getQuery($encoding)
    {
    ...

        //It's not empty or one term query
        $query = new Zend_Search_Lucene_Search_Query_MultiTerm();

        /**
         * @todo Process $token->getPositionIncrement() to support stemming, synonyms and other
         * analizer design features
         */
        foreach ($tokens as $token) {
            $term = new Zend_Search_Lucene_Index_Term($token->getTermText(), $this->_field);
            $query->addTerm($term, true); // all subterms are required 
        }

        $query->setBoost($this->_boost);

        return $query;
    }
The instance of "Zend_Search_Lucene_Search_Query_MultiTerm" is filled with required ($sign === true) terms by default .
I defined a static method and a member to set the sign for multi-terms for each parsing-process:

Code: Select all

    private static $_multiTermSign = true;

    public static function setMultiTermSign($sign)
    {
    	self::$_multiTermSign = $sign;
    }
.. and modified the method "getQuery()":

Code: Select all

        //It's not empty or one term query
        $query = new Zend_Search_Lucene_Search_Query_MultiTerm();

        /**
         * @todo Process $token->getPositionIncrement() to support stemming, synonyms and other
         * analizer design features
         */
        foreach ($tokens as $token) {
            $term = new Zend_Search_Lucene_Index_Term($token->getTermText(), $this->_field);
            //$query->addTerm($term, true); // all subterms are required
            $query->addTerm($term, self::$_multiTermSign); // all subterms are required or not, it's a matter of my mood
        }
Now it is possible using ...

Code: Select all

Zend_Search_Lucene_Search_QueryEntry_Term::setMultiTermSign(null);
... to control the behavior.


best regards,

Sebastian from Leipzig, Germany