Discuz! X2增加Sphinx全文检索支持操作记录

Sat, 28 Apr 2012 08:35:26 +0000

Sphinx是一个很好的全文可检索软件,它支持MySQL和PGSQL.

一般来说Sphinx原版对英文全文检索校好,但对中文全文检索的话,就使用国人的修改版coreseek了.

coreseek安装比较麻烦一些.是我的安装记录:

首先下载coreseek源码:

wget http://www.coreseek.cn/uploads/csft/3.2/coreseek-3.2.14.tar.gz

解压:

tar xzvf coreseek-3.2.14.tar.gz
cd coreseek-3.2.14

进入coreseek目录下有三个子目录,分别是mmseg和csft和testpack.需要分别先后安装mmseg和csft.

安装mmseg,中文分词库:

cd mmseg-3.2.14
aclocal
libtoolize --force
automake --add-missing
autoconf
autoheader
make clean #此时如有错误可忽略不管
./configure --prefix=/usr/local/mmseg3
make && make install

在这里先做一点点小优化:

cd data
ln -s /usr/local/mmseg3/mmseg /usr/bin/mmseg
mmseg -u unigram.txt
cp unigram.txt.uni /usr/local/mmseg3/etc/uni.lib
cd ..

回到上级目录:
cd ..

安装csft,也就是coreseek主程序:

cd csft-3.2.14
sh buildconf.sh
./configure --prefix=/usr/local/coreseek --without-python \
--without-unixodbc --with-mmseg --with-mmseg-includes=/usr/local/mmseg3/include/mmseg/ \
--with-mmseg-libs=/usr/local/mmseg3/lib/ --with-mysql
make && make install
cd ..

这样coreseek就安装好了.

简单测试一下coreseek是否运行正确:

cd ../testpack
/usr/local/coreseek/bin/indexer -c etc/csft.conf
##以下为正常情况下的提示信息：
Coreseek Fulltext 3.2 [ Sphinx 0.9.9-release (r2117)]
Copyright (c) 2007-2010,
Beijing Choice Software Technologies Inc (http://www.coreseek.com)

using config file 'etc/csft.conf'...
total 0 reads, 0.000 sec, 0.0 kb/call avg, 0.0 msec/call avg
total 0 writes, 0.000 sec, 0.0 kb/call avg, 0.0 msec/call avg

/usr/local/coreseek/bin/indexer -c etc/csft.conf --all

下面修改配置,支持Discuz! X2:

vi /usr/local/coreseek/etc/sphinx.conf

#
# Sphinx configuration file sample
#
# WARNING! While this sample file mentions all available options,
# it contains (very) short helper descriptions only. Please refer to
# doc/sphinx.html for details.
#

#############################################################################
## data source definition
#############################################################################

source threads
{
# data source type. mandatory, no default value
# known types are mysql, pgsql, mssql, xmlpipe, xmlpipe2, odbc
type                    = mysql

#####################################################################
## SQL settings (for 'mysql' and 'pgsql' types)
#####################################################################

# some straightforward parameters for SQL source types
sql_host                = localhost
sql_user                = XXX
sql_pass                = XXX
sql_db                    = XXX
sql_port                = 3306    # optional, default is 3306

# UNIX socket name
# optional, default is empty (reuse client library defaults)
# usually '/var/lib/mysql/mysql.sock' on Linux
# usually '/tmp/mysql.sock' on FreeBSD
#
# sql_sock                = /tmp/mysql.sock

# MySQL specific client connection flags
# optional, default is 0
#
# mysql_connect_flags    = 32 # enable compression

# MySQL specific SSL certificate settings
# optional, defaults are empty
#
# mysql_ssl_cert        = /etc/ssl/client-cert.pem
# mysql_ssl_key        = /etc/ssl/client-key.pem
# mysql_ssl_ca        = /etc/ssl/cacert.pem

# MS SQL specific Windows authentication mode flag
# MUST be in sync with charset_type index-level setting
# optional, default is 0
#
# mssql_winauth            = 1 # use currently logged on user credentials

# MS SQL specific Unicode indexing flag
# optional, default is 0 (request SBCS data)
#
# mssql_unicode            = 1 # request Unicode data from server

# ODBC specific DSN (data source name)
# mandatory for odbc source type, no default value
#
# odbc_dsn                = DBQ=C:\data;DefaultDir=C:\data;Driver={Microsoft Text Driver (*.txt; *.csv)};
# sql_query                = SELECT id, data FROM documents.csv

# pre-query, executed before the main fetch query
# multi-value, optional, default is empty list of queries
#
sql_query_pre            = SET NAMES utf8
sql_query_pre            = SET SESSION query_cache_type=OFF

#timy
sql_query_pre           = CREATE TABLE IF NOT EXISTS sph_counter ( counter_id INTEGER PRIMARY KEY NOT NULL,max_doc_id INTEGER NOT NULL)
sql_query_pre            = REPLACE INTO sph_counter SELECT 1, MAX(tid)-100 FROM pre_forum_thread
#timy

# main document fetch query
# mandatory, integer document ID field MUST be the first selected column
sql_query                = \
SELECT t.tid AS id,t.tid,t.subject,t.digest,t.displayorder,t.authorid,t.lastpost,t.special \
FROM pre_forum_thread AS t \
WHERE t.tid>=$start AND t.tid<=$end

# range query setup, query that must return min and max ID values
# optional, default is empty
#
# sql_query will need to reference $start and $end boundaries
# if using ranged query:
#
# sql_query                = \
#    SELECT doc.id, doc.id AS group, doc.title, doc.data \
#    FROM documents doc \
#    WHERE id>=$start AND id<=$end
#
sql_query_range        = SELECT (SELECT MIN(tid) FROM pre_forum_thread),max_doc_id FROM sph_counter WHERE counter_id=1

# range query step
# optional, default is 1024
#
# sql_range_step        = 1000

# unsigned integer attribute declaration
# multi-value (an arbitrary number of attributes is allowed), optional
# optional bit size can be specified, default is 32
#
# sql_attr_uint            = author_id
# sql_attr_uint            = forum_id:9 # 9 bits for forum_id
sql_attr_uint            = tid
sql_attr_uint            = digest
sql_attr_uint            = displayorder
sql_attr_uint            = authorid
sql_attr_uint            = special

# boolean attribute declaration
# multi-value (an arbitrary number of attributes is allowed), optional
# equivalent to sql_attr_uint with 1-bit size
#
# sql_attr_bool            = is_deleted

# bigint attribute declaration
# multi-value (an arbitrary number of attributes is allowed), optional
# declares a signed (unlike uint!) 64-bit attribute
#
# sql_attr_bigint            = my_bigint_id

# UNIX timestamp attribute declaration
# multi-value (an arbitrary number of attributes is allowed), optional
# similar to integer, but can also be used in date functions
#
# sql_attr_timestamp    = posted_ts
# sql_attr_timestamp    = last_edited_ts
sql_attr_timestamp        = lastpost

# string ordinal attribute declaration
# multi-value (an arbitrary number of attributes is allowed), optional
# sorts strings (bytewise), and stores their indexes in the sorted list
# sorting by this attr is equivalent to sorting by the original strings
#
# sql_attr_str2ordinal    = author_name

# floating point attribute declaration
# multi-value (an arbitrary number of attributes is allowed), optional
# values are stored in single precision, 32-bit IEEE 754 format
#
# sql_attr_float = lat_radians
# sql_attr_float = long_radians

# multi-valued attribute (MVA) attribute declaration
# multi-value (an arbitrary number of attributes is allowed), optional
# MVA values are variable length lists of unsigned 32-bit integers
#
# syntax is ATTR-TYPE ATTR-NAME 'from' SOURCE-TYPE [;QUERY] [;RANGE-QUERY]
# ATTR-TYPE is 'uint' or 'timestamp'
# SOURCE-TYPE is 'field', 'query', or 'ranged-query'
# QUERY is SQL query used to fetch all ( docid, attrvalue ) pairs
# RANGE-QUERY is SQL query used to fetch min and max ID values, similar to 'sql_query_range'
#
# sql_attr_multi    = uint tag from query; SELECT id, tag FROM tags
# sql_attr_multi    = uint tag from ranged-query; \
#    SELECT id, tag FROM tags WHERE id>=$start AND id<=$end; \
#    SELECT MIN(id), MAX(id) FROM tags

# post-query, executed on sql_query completion
# optional, default is empty
#
# sql_query_post        =

# post-index-query, executed on successful indexing completion
# optional, default is empty
# $maxid expands to max document ID actually fetched from DB
#
# sql_query_post_index = REPLACE INTO counters ( id, val ) \
#    VALUES ( 'max_indexed_id', $maxid )

# ranged query throttling, in milliseconds
# optional, default is 0 which means no delay
# enforces given delay before each query step
sql_ranged_throttle    = 0

# document info query, ONLY for CLI search (ie. testing and debugging)
# optional, default is empty
# must contain $id macro and must fetch the document by that id
sql_query_info        = SELECT * FROM pre_forum_thread WHERE tid=$id

# kill-list query, fetches the document IDs for kill-list
# k-list will suppress matches from preceding indexes in the same query
# optional, default is empty
#
# sql_query_killlist    = SELECT id FROM documents WHERE edited>=@last_reindex

# columns to unpack on indexer side when indexing
# multi-value, optional, default is empty list
#
# unpack_zlib = zlib_column
# unpack_mysqlcompress = compressed_column
# unpack_mysqlcompress = compressed_column_2

# maximum unpacked length allowed in MySQL COMPRESS() unpacker
# optional, default is 16M
#
# unpack_mysqlcompress_maxsize = 16M

#####################################################################
## xmlpipe settings
#####################################################################

# type                = xmlpipe

# shell command to invoke xmlpipe stream producer
# mandatory
#
# xmlpipe_command    = cat /usr/local/coreseek/var/test.xml

#####################################################################
## xmlpipe2 settings
#####################################################################

# type                = xmlpipe2
# xmlpipe_command    = cat /usr/local/coreseek/var/test2.xml

# xmlpipe2 field declaration
# multi-value, optional, default is empty
#
# xmlpipe_field                = subject
# xmlpipe_field                = content

# xmlpipe2 attribute declaration
# multi-value, optional, default is empty
# all xmlpipe_attr_XXX options are fully similar to sql_attr_XXX
#
# xmlpipe_attr_timestamp    = published
# xmlpipe_attr_uint            = author_id

# perform UTF-8 validation, and filter out incorrect codes
# avoids XML parser choking on non-UTF-8 documents
# optional, default is 0
#
# xmlpipe_fixup_utf8        = 1
}

#############################################################################
## index definition
#############################################################################

# local index example
#
# this is an index which is stored locally in the filesystem
#
# all indexing-time options (such as morphology and charsets)
# are configured per local index
index threads
{
# document source(s) to index
# multi-value, mandatory
# document IDs must be globally unique across all sources
source            = threads

# index files path and file name, without extension
# mandatory, path must be writable, extensions will be auto-appended
path            = /usr/local/coreseek/var/data/threads

# document attribute values (docinfo) storage mode
# optional, default is 'extern'
# known values are 'none', 'extern' and 'inline'
docinfo            = extern
#charset_dictpath = /etc

# memory locking for cached data (.spa and .spi), to prevent swapping
# optional, default is 0 (do not mlock)
# requires searchd to be run from root
mlock            = 0

# a list of morphology preprocessors to apply
# optional, default is empty
#
# builtin preprocessors are 'none', 'stem_en', 'stem_ru', 'stem_enru',
# 'soundex', and 'metaphone'; additional preprocessors available from
# libstemmer are 'libstemmer_XXX', where XXX is algorithm code
# (see libstemmer_c/libstemmer/modules.txt)
#
# morphology     = stem_en, stem_ru, soundex
# morphology    = libstemmer_german
# morphology    = libstemmer_sv
morphology        = none

# minimum word length at which to enable stemming
# optional, default is 1 (stem everything)
#
# min_stemming_len    = 1

# stopword files list (space separated)
# optional, default is empty
# contents are plain text, charset_table and stemming are both applied
#
# stopwords            = /usr/local/coreseek/var/data/stopwords.txt

# wordforms file, in "mapfrom > mapto" plain text format
# optional, default is empty
#
# wordforms            = /usr/local/sphinx/var/data/wordforms.txt

# tokenizing exceptions file
# optional, default is empty
#
# plain text, case sensitive, space insensitive in map-from part
# one "Map Several Words => ToASingleOne" entry per line
#
# exceptions        = /usr/local/sphinx/var/data/exceptions.txt

# minimum indexed word length
# default is 1 (index everything)
min_word_len        = 1

# charset encoding type
# optional, default is 'sbcs'
# known types are 'sbcs' (Single Byte CharSet) and 'utf-8'
charset_type        = utf-8
charset_dictpath = /usr/local/mmseg3/etc/

##### 字符表，注意：如使用这种方式，则sphinx会对中文进行单字切分，
##### 即进行字索引，若要使用中文分词，必须使用其他分词插件如 coreseek，sfc
charset_table = U+FF10..U+FF19->0..9, 0..9, U+FF41..U+FF5A->a..z, U+FF21..U+FF3A->a..z,\
A..Z->a..z, a..z, U+0149, U+017F, U+0138, U+00DF, U+00FF, U+00C0..U+00D6->U+00E0..U+00F6,\
U+00E0..U+00F6, U+00D8..U+00DE->U+00F8..U+00FE, U+00F8..U+00FE, U+0100->U+0101, U+0101,\
U+0102->U+0103, U+0103, U+0104->U+0105, U+0105, U+0106->U+0107, U+0107, U+0108->U+0109,\
U+0109, U+010A->U+010B, U+010B, U+010C->U+010D, U+010D, U+010E->U+010F, U+010F,\
U+0110->U+0111, U+0111, U+0112->U+0113, U+0113, U+0114->U+0115, U+0115, \
U+0116->U+0117,U+0117, U+0118->U+0119, U+0119, U+011A->U+011B, U+011B, U+011C->U+011D,\
U+011D,U+011E->U+011F, U+011F, U+0130->U+0131, U+0131, U+0132->U+0133, U+0133, \
U+0134->U+0135,U+0135, U+0136->U+0137, U+0137, U+0139->U+013A, U+013A, U+013B->U+013C, \
U+013C,U+013D->U+013E, U+013E, U+013F->U+0140, U+0140, U+0141->U+0142, U+0142, \
U+0143->U+0144,U+0144, U+0145->U+0146, U+0146, U+0147->U+0148, U+0148, U+014A->U+014B, \
U+014B,U+014C->U+014D, U+014D, U+014E->U+014F, U+014F, U+0150->U+0151, U+0151, \
U+0152->U+0153,U+0153, U+0154->U+0155, U+0155, U+0156->U+0157, U+0157, U+0158->U+0159,\
U+0159,U+015A->U+015B, U+015B, U+015C->U+015D, U+015D, U+015E->U+015F, U+015F, \
U+0160->U+0161,U+0161, U+0162->U+0163, U+0163, U+0164->U+0165, U+0165, U+0166->U+0167, \
U+0167,U+0168->U+0169, U+0169, U+016A->U+016B, U+016B, U+016C->U+016D, U+016D, \
U+016E->U+016F,U+016F, U+0170->U+0171, U+0171, U+0172->U+0173, U+0173, U+0174->U+0175,\
U+0175,U+0176->U+0177, U+0177, U+0178->U+00FF, U+00FF, U+0179->U+017A, U+017A, \
U+017B->U+017C,U+017C, U+017D->U+017E, U+017E, U+0410..U+042F->U+0430..U+044F, \
U+0430..U+044F,U+05D0..U+05EA, U+0531..U+0556->U+0561..U+0586, U+0561..U+0587, \
U+0621..U+063A, U+01B9,U+01BF, U+0640..U+064A, U+0660..U+0669, U+066E, U+066F, \
U+0671..U+06D3, U+06F0..U+06FF,U+0904..U+0939, U+0958..U+095F, U+0960..U+0963, \
U+0966..U+096F, U+097B..U+097F,U+0985..U+09B9, U+09CE, U+09DC..U+09E3, U+09E6..U+09EF, \
U+0A05..U+0A39, U+0A59..U+0A5E,U+0A66..U+0A6F, U+0A85..U+0AB9, U+0AE0..U+0AE3, \
U+0AE6..U+0AEF, U+0B05..U+0B39,U+0B5C..U+0B61, U+0B66..U+0B6F, U+0B71, U+0B85..U+0BB9, \
U+0BE6..U+0BF2, U+0C05..U+0C39,U+0C66..U+0C6F, U+0C85..U+0CB9, U+0CDE..U+0CE3, \
U+0CE6..U+0CEF, U+0D05..U+0D39, U+0D60,U+0D61, U+0D66..U+0D6F, U+0D85..U+0DC6, \
U+1900..U+1938, U+1946..U+194F, U+A800..U+A805,U+A807..U+A822, U+0386->U+03B1, \
U+03AC->U+03B1, U+0388->U+03B5, U+03AD->U+03B5,U+0389->U+03B7, U+03AE->U+03B7, \
U+038A->U+03B9, U+0390->U+03B9, U+03AA->U+03B9,U+03AF->U+03B9, U+03CA->U+03B9, \
U+038C->U+03BF, U+03CC->U+03BF, U+038E->U+03C5,U+03AB->U+03C5, U+03B0->U+03C5, \
U+03CB->U+03C5, U+03CD->U+03C5, U+038F->U+03C9,U+03CE->U+03C9, U+03C2->U+03C3, \
U+0391..U+03A1->U+03B1..U+03C1,U+03A3..U+03A9->U+03C3..U+03C9, U+03B1..U+03C1, \
U+03C3..U+03C9, U+0E01..U+0E2E,U+0E30..U+0E3A, U+0E40..U+0E45, U+0E47, U+0E50..U+0E59, \
U+A000..U+A48F, U+4E00..U+9FBF,U+3400..U+4DBF, U+20000..U+2A6DF, U+F900..U+FAFF, \
U+2F800..U+2FA1F, U+2E80..U+2EFF,U+2F00..U+2FDF, U+3100..U+312F, U+31A0..U+31BF, \
U+3040..U+309F, U+30A0..U+30FF,U+31F0..U+31FF, U+AC00..U+D7AF, U+1100..U+11FF, \
U+3130..U+318F, U+A000..U+A48F,U+A490..U+A4CF
min_prefix_len = 0
min_infix_len = 1
ngram_len = 1

# charset definition and case folding rules "table"
# optional, default value depends on charset_type
#
# defaults are configured to include English and Russian characters only
# you need to change the table to include additional ones
# this behavior MAY change in future versions
#
# 'sbcs' default value is
# charset_table        = 0..9, A..Z->a..z, _, a..z, U+A8->U+B8, U+B8, U+C0..U+DF->U+E0..U+FF, U+E0..U+FF
#
# 'utf-8' default value is
# charset_table        = 0..9, A..Z->a..z, _, a..z, U+410..U+42F->U+430..U+44F, U+430..U+44F

# ignored characters list
# optional, default value is empty
#
# ignore_chars        = U+00AD

# minimum word prefix length to index
# optional, default is 0 (do not index prefixes)
#
# min_prefix_len    = 0

# minimum word infix length to index
# optional, default is 0 (do not index infixes)
#
# min_infix_len        = 0

# list of fields to limit prefix/infix indexing to
# optional, default value is empty (index all fields in prefix/infix mode)
#
# prefix_fields        = filename
# infix_fields        = url, domain

# enable star-syntax (wildcards) when searching prefix/infix indexes
# known values are 0 and 1
# optional, default is 0 (do not use wildcard syntax)
#
# enable_star        = 1

# n-gram length to index, for CJK indexing
# only supports 0 and 1 for now, other lengths to be implemented
# optional, default is 0 (disable n-grams)
#
# ngram_len                = 1

# n-gram characters list, for CJK indexing
# optional, default is empty
#
# ngram_chars            = U+3000..U+2FA1F

# phrase boundary characters list
# optional, default is empty
#
# phrase_boundary        = ., ?, !, U+2026 # horizontal ellipsis

# phrase boundary word position increment
# optional, default is 0
#
# phrase_boundary_step    = 100

# whether to strip HTML tags from incoming documents
# known values are 0 (do not strip) and 1 (do strip)
# optional, default is 0
html_strip                = 0

# what HTML attributes to index if stripping HTML
# optional, default is empty (do not index anything)
#
# html_index_attrs        = img=alt,title; a=title;

# what HTML elements contents to strip
# optional, default is empty (do not strip element contents)
#
# html_remove_elements    = style, script

# whether to preopen index data files on startup
# optional, default is 0 (do not preopen), searchd-only
#
# preopen                    = 1

# whether to keep dictionary (.spi) on disk, or cache it in RAM
# optional, default is 0 (cache in RAM), searchd-only
#
# ondisk_dict                = 1

# whether to enable in-place inversion (2x less disk, 90-95% speed)
# optional, default is 0 (use separate temporary files), indexer-only
#
# inplace_enable            = 1

# in-place fine-tuning options
# optional, defaults are listed below
#
# inplace_hit_gap            = 0        # preallocated hitlist gap size
# inplace_docinfo_gap        = 0        # preallocated docinfo gap size
# inplace_reloc_factor    = 0.1    # relocation buffer size within arena
# inplace_write_factor    = 0.1    # write buffer size within arena

# whether to index original keywords along with stemmed versions
# enables "=exactform" operator to work
# optional, default is 0
#
# index_exact_words        = 1

# position increment on overshort (less that min_word_len) words
# optional, allowed values are 0 and 1, default is 1
#
# overshort_step            = 1

# position increment on stopword
# optional, allowed values are 0 and 1, default is 1
#
# stopword_step            = 1
}

#threads_minute
source threads_minute : threads
{
sql_query_pre            =
sql_query_pre            = SET NAMES UTF8
sql_query_pre           = SET SESSION query_cache_type=OFF

sql_query_range            = SELECT max_doc_id+1,(SELECT MAX(tid) FROM pre_forum_thread) FROM sph_counter WHERE counter_id=1
}

#threads_minute
index threads_minute : threads
{
source            = threads_minute
path            = /usr/local/coreseek/var/data/threads_minute
}

#posts
source posts : threads
{
type                    = mysql
sql_query_pre            =
sql_query_pre            = SET NAMES UTF8
sql_query_pre           = SET SESSION query_cache_type=OFF
sql_query_pre           = CREATE TABLE IF NOT EXISTS sph_counter ( counter_id INTEGER PRIMARY KEY NOT NULL,max_doc_id INTEGER NOT NULL)
sql_query_pre            = REPLACE INTO sph_counter SELECT 2, MAX(pid)-5000 FROM pre_forum_post

sql_query                = SELECT p.pid AS id,p.tid,p.subject,p.message,t.digest,t.displayorder,t.authorid,t.lastpost,t.special \
FROM pre_forum_post AS p LEFT JOIN pre_forum_thread AS t USING(tid) \
WHERE p.pid>=$start AND p.pid<=$end

sql_query_range            = SELECT (SELECT MIN(pid) FROM pre_forum_post),max_doc_id FROM sph_counter WHERE counter_id=2
sql_range_step          = 4096

sql_attr_uint            = tid
sql_attr_uint            = digest
sql_attr_uint            = displayorder
sql_attr_uint            = authorid
sql_attr_uint            = special

sql_attr_timestamp        =lastpost

sql_query_info            = SELECT * FROM pre_forum_post WHERE pid=$id
}

#posts
index posts
{
source            = posts
path            = /usr/local/coreseek/var/data/posts
docinfo            = extern
mlock            = 0
morphology        = none
charset_dictpath= /usr/local/coreseek/etc/
charset_debug   =   0
#### 索引的词最小长度
min_word_len = 1
charset_type = utf-8
html_strip = 0

##### 字符表，注意：如使用这种方式，则sphinx会对中文进行单字切分，
##### 即进行字索引，若要使用中文分词，必须使用其他分词插件如 coreseek，sfc
charset_table = U+FF10..U+FF19->0..9, 0..9, U+FF41..U+FF5A->a..z, U+FF21..U+FF3A->a..z,\
A..Z->a..z, a..z, U+0149, U+017F, U+0138, U+00DF, U+00FF, U+00C0..U+00D6->U+00E0..U+00F6,\
U+00E0..U+00F6, U+00D8..U+00DE->U+00F8..U+00FE, U+00F8..U+00FE, U+0100->U+0101, U+0101,\
U+0102->U+0103, U+0103, U+0104->U+0105, U+0105, U+0106->U+0107, U+0107, U+0108->U+0109,\
U+0109, U+010A->U+010B, U+010B, U+010C->U+010D, U+010D, U+010E->U+010F, U+010F,\
U+0110->U+0111, U+0111, U+0112->U+0113, U+0113, U+0114->U+0115, U+0115, \
U+0116->U+0117,U+0117, U+0118->U+0119, U+0119, U+011A->U+011B, U+011B, U+011C->U+011D,\
U+011D,U+011E->U+011F, U+011F, U+0130->U+0131, U+0131, U+0132->U+0133, U+0133, \
U+0134->U+0135,U+0135, U+0136->U+0137, U+0137, U+0139->U+013A, U+013A, U+013B->U+013C, \
U+013C,U+013D->U+013E, U+013E, U+013F->U+0140, U+0140, U+0141->U+0142, U+0142, \
U+0143->U+0144,U+0144, U+0145->U+0146, U+0146, U+0147->U+0148, U+0148, U+014A->U+014B, \
U+014B,U+014C->U+014D, U+014D, U+014E->U+014F, U+014F, U+0150->U+0151, U+0151, \
U+0152->U+0153,U+0153, U+0154->U+0155, U+0155, U+0156->U+0157, U+0157, U+0158->U+0159,\
U+0159,U+015A->U+015B, U+015B, U+015C->U+015D, U+015D, U+015E->U+015F, U+015F, \
U+0160->U+0161,U+0161, U+0162->U+0163, U+0163, U+0164->U+0165, U+0165, U+0166->U+0167, \
U+0167,U+0168->U+0169, U+0169, U+016A->U+016B, U+016B, U+016C->U+016D, U+016D, \
U+016E->U+016F,U+016F, U+0170->U+0171, U+0171, U+0172->U+0173, U+0173, U+0174->U+0175,\
U+0175,U+0176->U+0177, U+0177, U+0178->U+00FF, U+00FF, U+0179->U+017A, U+017A, \
U+017B->U+017C,U+017C, U+017D->U+017E, U+017E, U+0410..U+042F->U+0430..U+044F, \
U+0430..U+044F,U+05D0..U+05EA, U+0531..U+0556->U+0561..U+0586, U+0561..U+0587, \
U+0621..U+063A, U+01B9,U+01BF, U+0640..U+064A, U+0660..U+0669, U+066E, U+066F, \
U+0671..U+06D3, U+06F0..U+06FF,U+0904..U+0939, U+0958..U+095F, U+0960..U+0963, \
U+0966..U+096F, U+097B..U+097F,U+0985..U+09B9, U+09CE, U+09DC..U+09E3, U+09E6..U+09EF, \
U+0A05..U+0A39, U+0A59..U+0A5E,U+0A66..U+0A6F, U+0A85..U+0AB9, U+0AE0..U+0AE3, \
U+0AE6..U+0AEF, U+0B05..U+0B39,U+0B5C..U+0B61, U+0B66..U+0B6F, U+0B71, U+0B85..U+0BB9, \
U+0BE6..U+0BF2, U+0C05..U+0C39,U+0C66..U+0C6F, U+0C85..U+0CB9, U+0CDE..U+0CE3, \
U+0CE6..U+0CEF, U+0D05..U+0D39, U+0D60,U+0D61, U+0D66..U+0D6F, U+0D85..U+0DC6, \
U+1900..U+1938, U+1946..U+194F, U+A800..U+A805,U+A807..U+A822, U+0386->U+03B1, \
U+03AC->U+03B1, U+0388->U+03B5, U+03AD->U+03B5,U+0389->U+03B7, U+03AE->U+03B7, \
U+038A->U+03B9, U+0390->U+03B9, U+03AA->U+03B9,U+03AF->U+03B9, U+03CA->U+03B9, \
U+038C->U+03BF, U+03CC->U+03BF, U+038E->U+03C5,U+03AB->U+03C5, U+03B0->U+03C5, \
U+03CB->U+03C5, U+03CD->U+03C5, U+038F->U+03C9,U+03CE->U+03C9, U+03C2->U+03C3, \
U+0391..U+03A1->U+03B1..U+03C1,U+03A3..U+03A9->U+03C3..U+03C9, U+03B1..U+03C1, \
U+03C3..U+03C9, U+0E01..U+0E2E,U+0E30..U+0E3A, U+0E40..U+0E45, U+0E47, U+0E50..U+0E59, \
U+A000..U+A48F, U+4E00..U+9FBF,U+3400..U+4DBF, U+20000..U+2A6DF, U+F900..U+FAFF, \
U+2F800..U+2FA1F, U+2E80..U+2EFF,U+2F00..U+2FDF, U+3100..U+312F, U+31A0..U+31BF, \
U+3040..U+309F, U+30A0..U+30FF,U+31F0..U+31FF, U+AC00..U+D7AF, U+1100..U+11FF, \
U+3130..U+318F, U+A000..U+A48F,U+A490..U+A4CF
min_prefix_len = 0
min_infix_len = 1
ngram_len = 1

}

#posts_minute
source posts_minute : posts
{
sql_query_pre            =
sql_query_pre            = SET NAMES UTF8
sql_query_pre           = SET SESSION query_cache_type=OFF

sql_query_range            = SELECT max_doc_id+1,(SELECT MAX(pid) FROM pre_forum_post) FROM sph_counter WHERE counter_id=2
}

#posts_minute
index posts_minute : posts
{
source            = posts_minute
path            = /usr/local/coreseek/var/data/posts_minute
}

#############################################################################
## indexer settings
#############################################################################

indexer
{
# memory limit, in bytes, kiloytes (16384K) or megabytes (256M)
# optional, default is 32M, max is 2047M, recommended is 256M to 1024M
mem_limit            = 256M

# maximum IO calls per second (for I/O throttling)
# optional, default is 0 (unlimited)
#
# max_iops            = 40

# maximum IO call size, bytes (for I/O throttling)
# optional, default is 0 (unlimited)
#
# max_iosize        = 1048576

# maximum xmlpipe2 field length, bytes
# optional, default is 2M
#
# max_xmlpipe2_field    = 4M

# write buffer size, bytes
# several (currently up to 4) buffers will be allocated
# write buffers are allocated in addition to mem_limit
# optional, default is 1M
#
# write_buffer        = 1M
}

#############################################################################
## searchd settings
#############################################################################

searchd
{
# hostname, port, or hostname:port, or /unix/socket/path to listen on
# multi-value, multiple listen points are allowed
# optional, default is 0.0.0.0:9312 (listen on all interfaces, port 9312)
#
# listen                = 127.0.0.1
# listen                = 192.168.0.1:9312
listen                = 9312
# listen                = /var/run/searchd.sock

# log file, searchd run info is logged here
# optional, default is 'searchd.log'
log                    = /usr/local/coreseek/var/log/searchd.log

# query log file, all search queries are logged here
# optional, default is empty (do not log queries)
query_log            = /usr/local/coreseek/var/log/query.log

# client read timeout, seconds
# optional, default is 5
read_timeout        = 5

# request timeout, seconds
# optional, default is 5 minutes
client_timeout        = 300

# maximum amount of children to fork (concurrent searches to run)
# optional, default is 0 (unlimited)
max_children        = 30

# PID file, searchd process ID file name
# mandatory
pid_file            = /usr/local/coreseek/var/log/searchd.pid

# max amount of matches the daemon ever keeps in RAM, per-index
# WARNING, THERE'S ALSO PER-QUERY LIMIT, SEE SetLimits() API CALL
# default is 1000 (just like Google)
max_matches            = 1000

# seamless rotate, prevents rotate stalls if precaching huge datasets
# optional, default is 1
seamless_rotate        = 1

# whether to forcibly preopen all indexes on startup
# optional, default is 0 (do not preopen)
preopen_indexes        = 0

# whether to unlink .old index copies on succesful rotation.
# optional, default is 1 (do unlink)
unlink_old            = 1

# attribute updates periodic flush timeout, seconds
# updates will be automatically dumped to disk this frequently
# optional, default is 0 (disable periodic flush)
#
# attr_flush_period    = 900

# instance-wide ondisk_dict defaults (per-index value take precedence)
# optional, default is 0 (precache all dictionaries in RAM)
#
# ondisk_dict_default    = 1

# MVA updates pool size
# shared between all instances of searchd, disables attr flushes!
# optional, default size is 1M
mva_updates_pool    = 1M

# max allowed network packet size
# limits both query packets from clients, and responses from agents
# optional, default size is 8M
max_packet_size        = 8M

# crash log path
# searchd will (try to) log crashed query to 'crash_log_path.PID' file
# optional, default is empty (do not create crash logs)
#
# crash_log_path        = /usr/local/coreseek/var/log/crash

# max allowed per-query filter count
# optional, default is 256
max_filters            = 256

# max allowed per-filter values count
# optional, default is 4096
max_filter_values    = 4096

# socket listen queue length
# optional, default is 5
#
# listen_backlog        = 5

# per-keyword read buffer size
# optional, default is 256K
#
# read_buffer            = 256K

# unhinted read size (currently used when reading hits)
# optional, default is 32K
#
# read_unhinted        = 32K
}

# --eof--

接着生成索引:

/usr/local/coreseek/bin/indexer -c /usr/local/coreseek/etc/sphinx.conf --all

完成之后,启动服务进程searchd:

/usr/local/coreseek/bin/searchd -c /usr/local/coreseek/etc/sphinx.conf

到这里,coreseek(shpinx)就已经能正常服务了,下面需要修改一下DZ X2:
登录X2后台,全局=>搜索设置=>启动sphinx作为全文检索,具体配置如下图:

一般地说,到这里就可以完全正常地使用Dz x2的shpinx全文检索功能了.但因为shpinx全文检索功能真的很强,我打算使用全文检索来取代常规搜索.所以还需要修改一个Dz x2的一个搜索程序:
修改source\module\search\search_forum.php

找到$srchtype = empty($_G['gp_srchtype']) ? '' : trim($_G['gp_srchtype']);,并在前面添加一个#号,然后新起一行,添加:$srchtype = empty($_G['gp_srchtype']) ? '' :'fulltext';这样,不管用户有没有选择全文搜索,都是使用sphinx的全文检索功能了.(在这里说一下,原来我使用$srchtype = 'fulltext';后来发现,这样会造成"查看新帖 "功能用不了,今天修复了.)

接着需要配置一下sphinx增量索引:

增量索引:build_delta_index.sh

#!/bin/sh
/usr/local/coreseek/bin/indexer --config /usr/local/coreseek/etc/sphinx.conf threads_minute posts_minute --rotate >> /var/log/sphinx_delta.log
#/usr/local/coreseek/bin/indexer --config /usr/local/coreseek/etc/sphinx.conf --merge threads threads_minute --rotate >> /var/log/sphinx_delta.log
#/usr/local/coreseek/bin/indexer --config /usr/local/coreseek/etc/sphinx.conf --merge posts posts_minute --rotate >> /var/log/sphinx_delta.log

注意后面两行的#号,原来我打算在处理增量索引的同时,执行一下索引合并的,但是,考虑一是由于现有的贴子数有230百万左右,每次合并都要花很长的时候,第二,由于sphinx在合并时,对于重复的记录并不会删除,而只是添加一个新记录,这样,索引文件的体积就会X2.现在的索引文件已经用了几十个G了,再X2那就没必要了.还好,DZ X2 会从主索引和增量索引去检索,所以我只是每天合并一次增量索引.目前看来,运行良好.

合并增量索引:merge_delta_index.sh

#!/bin/sh
/usr/local/coreseek/bin/indexer --config /usr/local/coreseek/etc/sphinx.conf --merge threads threads_minute --rotate >> /var/log/sphinx_delta.log
/usr/local/coreseek/bin/indexer --config /usr/local/coreseek/etc/sphinx.conf --merge posts posts_minute --rotate >> /var/log/sphinx_delta.log

主索引:build_main_index.sh

#!/bin/sh
/usr/local/coreseek/bin/indexer --config /usr/local/coreseek/etc/sphinx.conf threads posts --rotate >> /var/log/sphinx_main.log

Cron定时更新索引:

crontab -e

*/5 * * * * /root/build_delta_index.sh > /dev/null 2 >&1
0 3 * * * 1-6 /root/merge_delta_index.sh > /dev/null 2 >&1
0 3 * * 0 /root/build_main_index.sh > /dev/null 2 >&1

保存退出.

安装Sphinx过程中遇到的一些问题和解决方法:

/usr/local/coreseek/bin/search --config /usr/local/sphinx/etc/sphinx.conf
/usr/local/sphinx/bin/indexer: error while loading shared libraries: libmysqlclient.so.16: cannot open shared object file: No such file or directory

这个是因为coreseek(Sphinx)找不到 libmysqlclient.so引起的,解决方法:

vi /etc/ld.so.conf
在最后面添加一行:
/usr/lib/mysql
ldconfig

./bootstrap: line 23: aclocal: command not found
./bootstrap: line 24: libtoolize: command not found
需要安装 libtoo libtool,解决方法:
yum install autoconf automake libtoo libtool

WARNING: source 'xml': xmlpipe2 support NOT compiled in. To use xmlpipe2, install missing XML libraries, reconfigure, and rebuild Sphinx
需要安装libxml2,解决方法:

yum install libxml2
然后重新编译coreseek(Sphinx)

如果提示不支持charset_table的话,很可能你是运行标准版的Shpinx而不是coreseek,只有coreseek才支持这个属性.运行正常的coreseek路径就可以了.

沧海一粟

Discuz! X2增加Sphinx全文检索支持操作记录

[评论] Discuz! X2增加Sphinx全文检索支持操作记录