标题:Discuz! X2增加Sphinx全文检索支持操作记录 出处:沧海一粟 时间:Sat, 28 Apr 2012 16:35:26 +0000 作者:jed 地址:http://www.dzhope.com/post/913/ 内容: Sphinx是一个很好的全文可检索软件,它支持MySQL和PGSQL. 一般来说Sphinx原版对英文全文检索校好,但对中文全文检索的话,就使用国人的修改版coreseek了. coreseek安装比较麻烦一些.是我的安装记录: 首先下载coreseek源码: wget http://www.coreseek.cn/uploads/csft/3.2/coreseek-3.2.14.tar.gz 解压: tar xzvf coreseek-3.2.14.tar.gz cd coreseek-3.2.14 进入coreseek目录下有三个子目录,分别是mmseg和csft和testpack.需要分别先后安装mmseg和csft. 安装mmseg,中文分词库: cd mmseg-3.2.14 aclocal libtoolize --force automake --add-missing autoconf autoheader make clean #此时如有错误可忽略不管 ./configure --prefix=/usr/local/mmseg3 make && make install 在这里先做一点点小优化: cd data ln -s /usr/local/mmseg3/mmseg /usr/bin/mmseg mmseg -u unigram.txt cp unigram.txt.uni /usr/local/mmseg3/etc/uni.lib cd .. 回到上级目录: cd .. 安装csft,也就是coreseek主程序: cd csft-3.2.14 sh buildconf.sh ./configure --prefix=/usr/local/coreseek --without-python \ --without-unixodbc --with-mmseg --with-mmseg-includes=/usr/local/mmseg3/include/mmseg/ \ --with-mmseg-libs=/usr/local/mmseg3/lib/ --with-mysql make && make install cd .. 这样coreseek就安装好了. 简单测试一下coreseek是否运行正确: cd ../testpack /usr/local/coreseek/bin/indexer -c etc/csft.conf ##以下为正常情况下的提示信息: Coreseek Fulltext 3.2 [ Sphinx 0.9.9-release (r2117)] Copyright (c) 2007-2010, Beijing Choice Software Technologies Inc (http://www.coreseek.com) using config file 'etc/csft.conf'... total 0 reads, 0.000 sec, 0.0 kb/call avg, 0.0 msec/call avg total 0 writes, 0.000 sec, 0.0 kb/call avg, 0.0 msec/call avg /usr/local/coreseek/bin/indexer -c etc/csft.conf --all 下面修改配置,支持Discuz! X2: vi /usr/local/coreseek/etc/sphinx.conf # # Sphinx configuration file sample # # WARNING! While this sample file mentions all available options, # it contains (very) short helper descriptions only. Please refer to # doc/sphinx.html for details. # ############################################################################# ## data source definition ############################################################################# source threads { # data source type. mandatory, no default value # known types are mysql, pgsql, mssql, xmlpipe, xmlpipe2, odbc type = mysql ##################################################################### ## SQL settings (for 'mysql' and 'pgsql' types) ##################################################################### # some straightforward parameters for SQL source types sql_host = localhost sql_user = XXX sql_pass = XXX sql_db = XXX sql_port = 3306 # optional, default is 3306 # UNIX socket name # optional, default is empty (reuse client library defaults) # usually '/var/lib/mysql/mysql.sock' on Linux # usually '/tmp/mysql.sock' on FreeBSD # # sql_sock = /tmp/mysql.sock # MySQL specific client connection flags # optional, default is 0 # # mysql_connect_flags = 32 # enable compression # MySQL specific SSL certificate settings # optional, defaults are empty # # mysql_ssl_cert = /etc/ssl/client-cert.pem # mysql_ssl_key = /etc/ssl/client-key.pem # mysql_ssl_ca = /etc/ssl/cacert.pem # MS SQL specific Windows authentication mode flag # MUST be in sync with charset_type index-level setting # optional, default is 0 # # mssql_winauth = 1 # use currently logged on user credentials # MS SQL specific Unicode indexing flag # optional, default is 0 (request SBCS data) # # mssql_unicode = 1 # request Unicode data from server # ODBC specific DSN (data source name) # mandatory for odbc source type, no default value # # odbc_dsn = DBQ=C:\data;DefaultDir=C:\data;Driver={Microsoft Text Driver (*.txt; *.csv)}; # sql_query = SELECT id, data FROM documents.csv # pre-query, executed before the main fetch query # multi-value, optional, default is empty list of queries # sql_query_pre = SET NAMES utf8 sql_query_pre = SET SESSION query_cache_type=OFF #timy sql_query_pre = CREATE TABLE IF NOT EXISTS sph_counter ( counter_id INTEGER PRIMARY KEY NOT NULL,max_doc_id INTEGER NOT NULL) sql_query_pre = REPLACE INTO sph_counter SELECT 1, MAX(tid)-100 FROM pre_forum_thread #timy # main document fetch query # mandatory, integer document ID field MUST be the first selected column sql_query = \ SELECT t.tid AS id,t.tid,t.subject,t.digest,t.displayorder,t.authorid,t.lastpost,t.special \ FROM pre_forum_thread AS t \ WHERE t.tid>=$start AND t.tid<=$end # range query setup, query that must return min and max ID values # optional, default is empty # # sql_query will need to reference $start and $end boundaries # if using ranged query: # # sql_query = \ # SELECT doc.id, doc.id AS group, doc.title, doc.data \ # FROM documents doc \ # WHERE id>=$start AND id<=$end # sql_query_range = SELECT (SELECT MIN(tid) FROM pre_forum_thread),max_doc_id FROM sph_counter WHERE counter_id=1 # range query step # optional, default is 1024 # # sql_range_step = 1000 # unsigned integer attribute declaration # multi-value (an arbitrary number of attributes is allowed), optional # optional bit size can be specified, default is 32 # # sql_attr_uint = author_id # sql_attr_uint = forum_id:9 # 9 bits for forum_id sql_attr_uint = tid sql_attr_uint = digest sql_attr_uint = displayorder sql_attr_uint = authorid sql_attr_uint = special # boolean attribute declaration # multi-value (an arbitrary number of attributes is allowed), optional # equivalent to sql_attr_uint with 1-bit size # # sql_attr_bool = is_deleted # bigint attribute declaration # multi-value (an arbitrary number of attributes is allowed), optional # declares a signed (unlike uint!) 64-bit attribute # # sql_attr_bigint = my_bigint_id # UNIX timestamp attribute declaration # multi-value (an arbitrary number of attributes is allowed), optional # similar to integer, but can also be used in date functions # # sql_attr_timestamp = posted_ts # sql_attr_timestamp = last_edited_ts sql_attr_timestamp = lastpost # string ordinal attribute declaration # multi-value (an arbitrary number of attributes is allowed), optional # sorts strings (bytewise), and stores their indexes in the sorted list # sorting by this attr is equivalent to sorting by the original strings # # sql_attr_str2ordinal = author_name # floating point attribute declaration # multi-value (an arbitrary number of attributes is allowed), optional # values are stored in single precision, 32-bit IEEE 754 format # # sql_attr_float = lat_radians # sql_attr_float = long_radians # multi-valued attribute (MVA) attribute declaration # multi-value (an arbitrary number of attributes is allowed), optional # MVA values are variable length lists of unsigned 32-bit integers # # syntax is ATTR-TYPE ATTR-NAME 'from' SOURCE-TYPE [;QUERY] [;RANGE-QUERY] # ATTR-TYPE is 'uint' or 'timestamp' # SOURCE-TYPE is 'field', 'query', or 'ranged-query' # QUERY is SQL query used to fetch all ( docid, attrvalue ) pairs # RANGE-QUERY is SQL query used to fetch min and max ID values, similar to 'sql_query_range' # # sql_attr_multi = uint tag from query; SELECT id, tag FROM tags # sql_attr_multi = uint tag from ranged-query; \ # SELECT id, tag FROM tags WHERE id>=$start AND id<=$end; \ # SELECT MIN(id), MAX(id) FROM tags # post-query, executed on sql_query completion # optional, default is empty # # sql_query_post = # post-index-query, executed on successful indexing completion # optional, default is empty # $maxid expands to max document ID actually fetched from DB # # sql_query_post_index = REPLACE INTO counters ( id, val ) \ # VALUES ( 'max_indexed_id', $maxid ) # ranged query throttling, in milliseconds # optional, default is 0 which means no delay # enforces given delay before each query step sql_ranged_throttle = 0 # document info query, ONLY for CLI search (ie. testing and debugging) # optional, default is empty # must contain $id macro and must fetch the document by that id sql_query_info = SELECT * FROM pre_forum_thread WHERE tid=$id # kill-list query, fetches the document IDs for kill-list # k-list will suppress matches from preceding indexes in the same query # optional, default is empty # # sql_query_killlist = SELECT id FROM documents WHERE edited>=@last_reindex # columns to unpack on indexer side when indexing # multi-value, optional, default is empty list # # unpack_zlib = zlib_column # unpack_mysqlcompress = compressed_column # unpack_mysqlcompress = compressed_column_2 # maximum unpacked length allowed in MySQL COMPRESS() unpacker # optional, default is 16M # # unpack_mysqlcompress_maxsize = 16M ##################################################################### ## xmlpipe settings ##################################################################### # type = xmlpipe # shell command to invoke xmlpipe stream producer # mandatory # # xmlpipe_command = cat /usr/local/coreseek/var/test.xml ##################################################################### ## xmlpipe2 settings ##################################################################### # type = xmlpipe2 # xmlpipe_command = cat /usr/local/coreseek/var/test2.xml # xmlpipe2 field declaration # multi-value, optional, default is empty # # xmlpipe_field = subject # xmlpipe_field = content # xmlpipe2 attribute declaration # multi-value, optional, default is empty # all xmlpipe_attr_XXX options are fully similar to sql_attr_XXX # # xmlpipe_attr_timestamp = published # xmlpipe_attr_uint = author_id # perform UTF-8 validation, and filter out incorrect codes # avoids XML parser choking on non-UTF-8 documents # optional, default is 0 # # xmlpipe_fixup_utf8 = 1 } ############################################################################# ## index definition ############################################################################# # local index example # # this is an index which is stored locally in the filesystem # # all indexing-time options (such as morphology and charsets) # are configured per local index index threads { # document source(s) to index # multi-value, mandatory # document IDs must be globally unique across all sources source = threads # index files path and file name, without extension # mandatory, path must be writable, extensions will be auto-appended path = /usr/local/coreseek/var/data/threads # document attribute values (docinfo) storage mode # optional, default is 'extern' # known values are 'none', 'extern' and 'inline' docinfo = extern #charset_dictpath = /etc # memory locking for cached data (.spa and .spi), to prevent swapping # optional, default is 0 (do not mlock) # requires searchd to be run from root mlock = 0 # a list of morphology preprocessors to apply # optional, default is empty # # builtin preprocessors are 'none', 'stem_en', 'stem_ru', 'stem_enru', # 'soundex', and 'metaphone'; additional preprocessors available from # libstemmer are 'libstemmer_XXX', where XXX is algorithm code # (see libstemmer_c/libstemmer/modules.txt) # # morphology = stem_en, stem_ru, soundex # morphology = libstemmer_german # morphology = libstemmer_sv morphology = none # minimum word length at which to enable stemming # optional, default is 1 (stem everything) # # min_stemming_len = 1 # stopword files list (space separated) # optional, default is empty # contents are plain text, charset_table and stemming are both applied # # stopwords = /usr/local/coreseek/var/data/stopwords.txt # wordforms file, in "mapfrom > mapto" plain text format # optional, default is empty # # wordforms = /usr/local/sphinx/var/data/wordforms.txt # tokenizing exceptions file # optional, default is empty # # plain text, case sensitive, space insensitive in map-from part # one "Map Several Words => ToASingleOne" entry per line # # exceptions = /usr/local/sphinx/var/data/exceptions.txt # minimum indexed word length # default is 1 (index everything) min_word_len = 1 # charset encoding type # optional, default is 'sbcs' # known types are 'sbcs' (Single Byte CharSet) and 'utf-8' charset_type = utf-8 charset_dictpath = /usr/local/mmseg3/etc/ ##### 字符表,注意:如使用这种方式,则sphinx会对中文进行单字切分, ##### 即进行字索引,若要使用中文分词,必须使用其他分词插件如 coreseek,sfc charset_table = U+FF10..U+FF19->0..9, 0..9, U+FF41..U+FF5A->a..z, U+FF21..U+FF3A->a..z,\ A..Z->a..z, a..z, U+0149, U+017F, U+0138, U+00DF, U+00FF, U+00C0..U+00D6->U+00E0..U+00F6,\ U+00E0..U+00F6, U+00D8..U+00DE->U+00F8..U+00FE, U+00F8..U+00FE, U+0100->U+0101, U+0101,\ U+0102->U+0103, U+0103, U+0104->U+0105, U+0105, U+0106->U+0107, U+0107, U+0108->U+0109,\ U+0109, U+010A->U+010B, U+010B, U+010C->U+010D, U+010D, U+010E->U+010F, U+010F,\ U+0110->U+0111, U+0111, U+0112->U+0113, U+0113, U+0114->U+0115, U+0115, \ U+0116->U+0117,U+0117, U+0118->U+0119, U+0119, U+011A->U+011B, U+011B, U+011C->U+011D,\ U+011D,U+011E->U+011F, U+011F, U+0130->U+0131, U+0131, U+0132->U+0133, U+0133, \ U+0134->U+0135,U+0135, U+0136->U+0137, U+0137, U+0139->U+013A, U+013A, U+013B->U+013C, \ U+013C,U+013D->U+013E, U+013E, U+013F->U+0140, U+0140, U+0141->U+0142, U+0142, \ U+0143->U+0144,U+0144, U+0145->U+0146, U+0146, U+0147->U+0148, U+0148, U+014A->U+014B, \ U+014B,U+014C->U+014D, U+014D, U+014E->U+014F, U+014F, U+0150->U+0151, U+0151, \ U+0152->U+0153,U+0153, U+0154->U+0155, U+0155, U+0156->U+0157, U+0157, U+0158->U+0159,\ U+0159,U+015A->U+015B, U+015B, U+015C->U+015D, U+015D, U+015E->U+015F, U+015F, \ U+0160->U+0161,U+0161, U+0162->U+0163, U+0163, U+0164->U+0165, U+0165, U+0166->U+0167, \ U+0167,U+0168->U+0169, U+0169, U+016A->U+016B, U+016B, U+016C->U+016D, U+016D, \ U+016E->U+016F,U+016F, U+0170->U+0171, U+0171, U+0172->U+0173, U+0173, U+0174->U+0175,\ U+0175,U+0176->U+0177, U+0177, U+0178->U+00FF, U+00FF, U+0179->U+017A, U+017A, \ U+017B->U+017C,U+017C, U+017D->U+017E, U+017E, U+0410..U+042F->U+0430..U+044F, \ U+0430..U+044F,U+05D0..U+05EA, U+0531..U+0556->U+0561..U+0586, U+0561..U+0587, \ U+0621..U+063A, U+01B9,U+01BF, U+0640..U+064A, U+0660..U+0669, U+066E, U+066F, \ U+0671..U+06D3, U+06F0..U+06FF,U+0904..U+0939, U+0958..U+095F, U+0960..U+0963, \ U+0966..U+096F, U+097B..U+097F,U+0985..U+09B9, U+09CE, U+09DC..U+09E3, U+09E6..U+09EF, \ U+0A05..U+0A39, U+0A59..U+0A5E,U+0A66..U+0A6F, U+0A85..U+0AB9, U+0AE0..U+0AE3, \ U+0AE6..U+0AEF, U+0B05..U+0B39,U+0B5C..U+0B61, U+0B66..U+0B6F, U+0B71, U+0B85..U+0BB9, \ U+0BE6..U+0BF2, U+0C05..U+0C39,U+0C66..U+0C6F, U+0C85..U+0CB9, U+0CDE..U+0CE3, \ U+0CE6..U+0CEF, U+0D05..U+0D39, U+0D60,U+0D61, U+0D66..U+0D6F, U+0D85..U+0DC6, \ U+1900..U+1938, U+1946..U+194F, U+A800..U+A805,U+A807..U+A822, U+0386->U+03B1, \ U+03AC->U+03B1, U+0388->U+03B5, U+03AD->U+03B5,U+0389->U+03B7, U+03AE->U+03B7, \ U+038A->U+03B9, U+0390->U+03B9, U+03AA->U+03B9,U+03AF->U+03B9, U+03CA->U+03B9, \ U+038C->U+03BF, U+03CC->U+03BF, U+038E->U+03C5,U+03AB->U+03C5, U+03B0->U+03C5, \ U+03CB->U+03C5, U+03CD->U+03C5, U+038F->U+03C9,U+03CE->U+03C9, U+03C2->U+03C3, \ U+0391..U+03A1->U+03B1..U+03C1,U+03A3..U+03A9->U+03C3..U+03C9, U+03B1..U+03C1, \ U+03C3..U+03C9, U+0E01..U+0E2E,U+0E30..U+0E3A, U+0E40..U+0E45, U+0E47, U+0E50..U+0E59, \ U+A000..U+A48F, U+4E00..U+9FBF,U+3400..U+4DBF, U+20000..U+2A6DF, U+F900..U+FAFF, \ U+2F800..U+2FA1F, U+2E80..U+2EFF,U+2F00..U+2FDF, U+3100..U+312F, U+31A0..U+31BF, \ U+3040..U+309F, U+30A0..U+30FF,U+31F0..U+31FF, U+AC00..U+D7AF, U+1100..U+11FF, \ U+3130..U+318F, U+A000..U+A48F,U+A490..U+A4CF min_prefix_len = 0 min_infix_len = 1 ngram_len = 1 # charset definition and case folding rules "table" # optional, default value depends on charset_type # # defaults are configured to include English and Russian characters only # you need to change the table to include additional ones # this behavior MAY change in future versions # # 'sbcs' default value is # charset_table = 0..9, A..Z->a..z, _, a..z, U+A8->U+B8, U+B8, U+C0..U+DF->U+E0..U+FF, U+E0..U+FF # # 'utf-8' default value is # charset_table = 0..9, A..Z->a..z, _, a..z, U+410..U+42F->U+430..U+44F, U+430..U+44F # ignored characters list # optional, default value is empty # # ignore_chars = U+00AD # minimum word prefix length to index # optional, default is 0 (do not index prefixes) # # min_prefix_len = 0 # minimum word infix length to index # optional, default is 0 (do not index infixes) # # min_infix_len = 0 # list of fields to limit prefix/infix indexing to # optional, default value is empty (index all fields in prefix/infix mode) # # prefix_fields = filename # infix_fields = url, domain # enable star-syntax (wildcards) when searching prefix/infix indexes # known values are 0 and 1 # optional, default is 0 (do not use wildcard syntax) # # enable_star = 1 # n-gram length to index, for CJK indexing # only supports 0 and 1 for now, other lengths to be implemented # optional, default is 0 (disable n-grams) # # ngram_len = 1 # n-gram characters list, for CJK indexing # optional, default is empty # # ngram_chars = U+3000..U+2FA1F # phrase boundary characters list # optional, default is empty # # phrase_boundary = ., ?, !, U+2026 # horizontal ellipsis # phrase boundary word position increment # optional, default is 0 # # phrase_boundary_step = 100 # whether to strip HTML tags from incoming documents # known values are 0 (do not strip) and 1 (do strip) # optional, default is 0 html_strip = 0 # what HTML attributes to index if stripping HTML # optional, default is empty (do not index anything) # # html_index_attrs = img=alt,title; a=title; # what HTML elements contents to strip # optional, default is empty (do not strip element contents) # # html_remove_elements = style, script # whether to preopen index data files on startup # optional, default is 0 (do not preopen), searchd-only # # preopen = 1 # whether to keep dictionary (.spi) on disk, or cache it in RAM # optional, default is 0 (cache in RAM), searchd-only # # ondisk_dict = 1 # whether to enable in-place inversion (2x less disk, 90-95% speed) # optional, default is 0 (use separate temporary files), indexer-only # # inplace_enable = 1 # in-place fine-tuning options # optional, defaults are listed below # # inplace_hit_gap = 0 # preallocated hitlist gap size # inplace_docinfo_gap = 0 # preallocated docinfo gap size # inplace_reloc_factor = 0.1 # relocation buffer size within arena # inplace_write_factor = 0.1 # write buffer size within arena # whether to index original keywords along with stemmed versions # enables "=exactform" operator to work # optional, default is 0 # # index_exact_words = 1 # position increment on overshort (less that min_word_len) words # optional, allowed values are 0 and 1, default is 1 # # overshort_step = 1 # position increment on stopword # optional, allowed values are 0 and 1, default is 1 # # stopword_step = 1 } #threads_minute source threads_minute : threads { sql_query_pre = sql_query_pre = SET NAMES UTF8 sql_query_pre = SET SESSION query_cache_type=OFF sql_query_range = SELECT max_doc_id+1,(SELECT MAX(tid) FROM pre_forum_thread) FROM sph_counter WHERE counter_id=1 } #threads_minute index threads_minute : threads { source = threads_minute path = /usr/local/coreseek/var/data/threads_minute } #posts source posts : threads { type = mysql sql_query_pre = sql_query_pre = SET NAMES UTF8 sql_query_pre = SET SESSION query_cache_type=OFF sql_query_pre = CREATE TABLE IF NOT EXISTS sph_counter ( counter_id INTEGER PRIMARY KEY NOT NULL,max_doc_id INTEGER NOT NULL) sql_query_pre = REPLACE INTO sph_counter SELECT 2, MAX(pid)-5000 FROM pre_forum_post sql_query = SELECT p.pid AS id,p.tid,p.subject,p.message,t.digest,t.displayorder,t.authorid,t.lastpost,t.special \ FROM pre_forum_post AS p LEFT JOIN pre_forum_thread AS t USING(tid) \ WHERE p.pid>=$start AND p.pid<=$end sql_query_range = SELECT (SELECT MIN(pid) FROM pre_forum_post),max_doc_id FROM sph_counter WHERE counter_id=2 sql_range_step = 4096 sql_attr_uint = tid sql_attr_uint = digest sql_attr_uint = displayorder sql_attr_uint = authorid sql_attr_uint = special sql_attr_timestamp =lastpost sql_query_info = SELECT * FROM pre_forum_post WHERE pid=$id } #posts index posts { source = posts path = /usr/local/coreseek/var/data/posts docinfo = extern mlock = 0 morphology = none charset_dictpath= /usr/local/coreseek/etc/ charset_debug = 0 #### 索引的词最小长度 min_word_len = 1 charset_type = utf-8 html_strip = 0 ##### 字符表,注意:如使用这种方式,则sphinx会对中文进行单字切分, ##### 即进行字索引,若要使用中文分词,必须使用其他分词插件如 coreseek,sfc charset_table = U+FF10..U+FF19->0..9, 0..9, U+FF41..U+FF5A->a..z, U+FF21..U+FF3A->a..z,\ A..Z->a..z, a..z, U+0149, U+017F, U+0138, U+00DF, U+00FF, U+00C0..U+00D6->U+00E0..U+00F6,\ U+00E0..U+00F6, U+00D8..U+00DE->U+00F8..U+00FE, U+00F8..U+00FE, U+0100->U+0101, U+0101,\ U+0102->U+0103, U+0103, U+0104->U+0105, U+0105, U+0106->U+0107, U+0107, U+0108->U+0109,\ U+0109, U+010A->U+010B, U+010B, U+010C->U+010D, U+010D, U+010E->U+010F, U+010F,\ U+0110->U+0111, U+0111, U+0112->U+0113, U+0113, U+0114->U+0115, U+0115, \ U+0116->U+0117,U+0117, U+0118->U+0119, U+0119, U+011A->U+011B, U+011B, U+011C->U+011D,\ U+011D,U+011E->U+011F, U+011F, U+0130->U+0131, U+0131, U+0132->U+0133, U+0133, \ U+0134->U+0135,U+0135, U+0136->U+0137, U+0137, U+0139->U+013A, U+013A, U+013B->U+013C, \ U+013C,U+013D->U+013E, U+013E, U+013F->U+0140, U+0140, U+0141->U+0142, U+0142, \ U+0143->U+0144,U+0144, U+0145->U+0146, U+0146, U+0147->U+0148, U+0148, U+014A->U+014B, \ U+014B,U+014C->U+014D, U+014D, U+014E->U+014F, U+014F, U+0150->U+0151, U+0151, \ U+0152->U+0153,U+0153, U+0154->U+0155, U+0155, U+0156->U+0157, U+0157, U+0158->U+0159,\ U+0159,U+015A->U+015B, U+015B, U+015C->U+015D, U+015D, U+015E->U+015F, U+015F, \ U+0160->U+0161,U+0161, U+0162->U+0163, U+0163, U+0164->U+0165, U+0165, U+0166->U+0167, \ U+0167,U+0168->U+0169, U+0169, U+016A->U+016B, U+016B, U+016C->U+016D, U+016D, \ U+016E->U+016F,U+016F, U+0170->U+0171, U+0171, U+0172->U+0173, U+0173, U+0174->U+0175,\ U+0175,U+0176->U+0177, U+0177, U+0178->U+00FF, U+00FF, U+0179->U+017A, U+017A, \ U+017B->U+017C,U+017C, U+017D->U+017E, U+017E, U+0410..U+042F->U+0430..U+044F, \ U+0430..U+044F,U+05D0..U+05EA, U+0531..U+0556->U+0561..U+0586, U+0561..U+0587, \ U+0621..U+063A, U+01B9,U+01BF, U+0640..U+064A, U+0660..U+0669, U+066E, U+066F, \ U+0671..U+06D3, U+06F0..U+06FF,U+0904..U+0939, U+0958..U+095F, U+0960..U+0963, \ U+0966..U+096F, U+097B..U+097F,U+0985..U+09B9, U+09CE, U+09DC..U+09E3, U+09E6..U+09EF, \ U+0A05..U+0A39, U+0A59..U+0A5E,U+0A66..U+0A6F, U+0A85..U+0AB9, U+0AE0..U+0AE3, \ U+0AE6..U+0AEF, U+0B05..U+0B39,U+0B5C..U+0B61, U+0B66..U+0B6F, U+0B71, U+0B85..U+0BB9, \ U+0BE6..U+0BF2, U+0C05..U+0C39,U+0C66..U+0C6F, U+0C85..U+0CB9, U+0CDE..U+0CE3, \ U+0CE6..U+0CEF, U+0D05..U+0D39, U+0D60,U+0D61, U+0D66..U+0D6F, U+0D85..U+0DC6, \ U+1900..U+1938, U+1946..U+194F, U+A800..U+A805,U+A807..U+A822, U+0386->U+03B1, \ U+03AC->U+03B1, U+0388->U+03B5, U+03AD->U+03B5,U+0389->U+03B7, U+03AE->U+03B7, \ U+038A->U+03B9, U+0390->U+03B9, U+03AA->U+03B9,U+03AF->U+03B9, U+03CA->U+03B9, \ U+038C->U+03BF, U+03CC->U+03BF, U+038E->U+03C5,U+03AB->U+03C5, U+03B0->U+03C5, \ U+03CB->U+03C5, U+03CD->U+03C5, U+038F->U+03C9,U+03CE->U+03C9, U+03C2->U+03C3, \ U+0391..U+03A1->U+03B1..U+03C1,U+03A3..U+03A9->U+03C3..U+03C9, U+03B1..U+03C1, \ U+03C3..U+03C9, U+0E01..U+0E2E,U+0E30..U+0E3A, U+0E40..U+0E45, U+0E47, U+0E50..U+0E59, \ U+A000..U+A48F, U+4E00..U+9FBF,U+3400..U+4DBF, U+20000..U+2A6DF, U+F900..U+FAFF, \ U+2F800..U+2FA1F, U+2E80..U+2EFF,U+2F00..U+2FDF, U+3100..U+312F, U+31A0..U+31BF, \ U+3040..U+309F, U+30A0..U+30FF,U+31F0..U+31FF, U+AC00..U+D7AF, U+1100..U+11FF, \ U+3130..U+318F, U+A000..U+A48F,U+A490..U+A4CF min_prefix_len = 0 min_infix_len = 1 ngram_len = 1 } #posts_minute source posts_minute : posts { sql_query_pre = sql_query_pre = SET NAMES UTF8 sql_query_pre = SET SESSION query_cache_type=OFF sql_query_range = SELECT max_doc_id+1,(SELECT MAX(pid) FROM pre_forum_post) FROM sph_counter WHERE counter_id=2 } #posts_minute index posts_minute : posts { source = posts_minute path = /usr/local/coreseek/var/data/posts_minute } ############################################################################# ## indexer settings ############################################################################# indexer { # memory limit, in bytes, kiloytes (16384K) or megabytes (256M) # optional, default is 32M, max is 2047M, recommended is 256M to 1024M mem_limit = 256M # maximum IO calls per second (for I/O throttling) # optional, default is 0 (unlimited) # # max_iops = 40 # maximum IO call size, bytes (for I/O throttling) # optional, default is 0 (unlimited) # # max_iosize = 1048576 # maximum xmlpipe2 field length, bytes # optional, default is 2M # # max_xmlpipe2_field = 4M # write buffer size, bytes # several (currently up to 4) buffers will be allocated # write buffers are allocated in addition to mem_limit # optional, default is 1M # # write_buffer = 1M } ############################################################################# ## searchd settings ############################################################################# searchd { # hostname, port, or hostname:port, or /unix/socket/path to listen on # multi-value, multiple listen points are allowed # optional, default is 0.0.0.0:9312 (listen on all interfaces, port 9312) # # listen = 127.0.0.1 # listen = 192.168.0.1:9312 listen = 9312 # listen = /var/run/searchd.sock # log file, searchd run info is logged here # optional, default is 'searchd.log' log = /usr/local/coreseek/var/log/searchd.log # query log file, all search queries are logged here # optional, default is empty (do not log queries) query_log = /usr/local/coreseek/var/log/query.log # client read timeout, seconds # optional, default is 5 read_timeout = 5 # request timeout, seconds # optional, default is 5 minutes client_timeout = 300 # maximum amount of children to fork (concurrent searches to run) # optional, default is 0 (unlimited) max_children = 30 # PID file, searchd process ID file name # mandatory pid_file = /usr/local/coreseek/var/log/searchd.pid # max amount of matches the daemon ever keeps in RAM, per-index # WARNING, THERE'S ALSO PER-QUERY LIMIT, SEE SetLimits() API CALL # default is 1000 (just like Google) max_matches = 1000 # seamless rotate, prevents rotate stalls if precaching huge datasets # optional, default is 1 seamless_rotate = 1 # whether to forcibly preopen all indexes on startup # optional, default is 0 (do not preopen) preopen_indexes = 0 # whether to unlink .old index copies on succesful rotation. # optional, default is 1 (do unlink) unlink_old = 1 # attribute updates periodic flush timeout, seconds # updates will be automatically dumped to disk this frequently # optional, default is 0 (disable periodic flush) # # attr_flush_period = 900 # instance-wide ondisk_dict defaults (per-index value take precedence) # optional, default is 0 (precache all dictionaries in RAM) # # ondisk_dict_default = 1 # MVA updates pool size # shared between all instances of searchd, disables attr flushes! # optional, default size is 1M mva_updates_pool = 1M # max allowed network packet size # limits both query packets from clients, and responses from agents # optional, default size is 8M max_packet_size = 8M # crash log path # searchd will (try to) log crashed query to 'crash_log_path.PID' file # optional, default is empty (do not create crash logs) # # crash_log_path = /usr/local/coreseek/var/log/crash # max allowed per-query filter count # optional, default is 256 max_filters = 256 # max allowed per-filter values count # optional, default is 4096 max_filter_values = 4096 # socket listen queue length # optional, default is 5 # # listen_backlog = 5 # per-keyword read buffer size # optional, default is 256K # # read_buffer = 256K # unhinted read size (currently used when reading hits) # optional, default is 32K # # read_unhinted = 32K } # --eof-- 接着生成索引: /usr/local/coreseek/bin/indexer -c /usr/local/coreseek/etc/sphinx.conf --all 完成之后,启动服务进程searchd: /usr/local/coreseek/bin/searchd -c /usr/local/coreseek/etc/sphinx.conf 到这里,coreseek(shpinx)就已经能正常服务了,下面需要修改一下DZ X2: 登录X2后台,全局=>搜索设置=>启动sphinx作为全文检索,具体配置如下图: 一般地说,到这里就可以完全正常地使用Dz x2的shpinx全文检索功能了.但因为shpinx全文检索功能真的很强,我打算使用全文检索来取代常规搜索.所以还需要修改一个Dz x2的一个搜索程序: 修改source\module\search\search_forum.php 找到$srchtype = empty($_G['gp_srchtype']) ? '' : trim($_G['gp_srchtype']);,并在前面添加一个#号,然后新起一行,添加:$srchtype = empty($_G['gp_srchtype']) ? '' :'fulltext';这样,不管用户有没有选择全文搜索,都是使用sphinx的全文检索功能了.(在这里说一下,原来我使用$srchtype = 'fulltext';后来发现,这样会造成"查看新帖 "功能用不了,今天修复了.) 接着需要配置一下sphinx增量索引: 增量索引:build_delta_index.sh #!/bin/sh /usr/local/coreseek/bin/indexer --config /usr/local/coreseek/etc/sphinx.conf threads_minute posts_minute --rotate >> /var/log/sphinx_delta.log #/usr/local/coreseek/bin/indexer --config /usr/local/coreseek/etc/sphinx.conf --merge threads threads_minute --rotate >> /var/log/sphinx_delta.log #/usr/local/coreseek/bin/indexer --config /usr/local/coreseek/etc/sphinx.conf --merge posts posts_minute --rotate >> /var/log/sphinx_delta.log 注意后面两行的#号,原来我打算在处理增量索引的同时,执行一下索引合并的,但是,考虑一是由于现有的贴子数有230百万左右,每次合并都要花很长的时候,第二,由于sphinx在合并时,对于重复的记录并不会删除,而只是添加一个新记录,这样,索引文件的体积就会X2.现在的索引文件已经用了几十个G了,再X2那就没必要了.还好,DZ X2 会从主索引和增量索引去检索,所以我只是每天合并一次增量索引.目前看来,运行良好. 合并增量索引:merge_delta_index.sh #!/bin/sh /usr/local/coreseek/bin/indexer --config /usr/local/coreseek/etc/sphinx.conf --merge threads threads_minute --rotate >> /var/log/sphinx_delta.log /usr/local/coreseek/bin/indexer --config /usr/local/coreseek/etc/sphinx.conf --merge posts posts_minute --rotate >> /var/log/sphinx_delta.log 主索引:build_main_index.sh #!/bin/sh /usr/local/coreseek/bin/indexer --config /usr/local/coreseek/etc/sphinx.conf threads posts --rotate >> /var/log/sphinx_main.log Cron定时更新索引: crontab -e */5 * * * * /root/build_delta_index.sh > /dev/null 2 >&1 0 3 * * * 1-6 /root/merge_delta_index.sh > /dev/null 2 >&1 0 3 * * 0 /root/build_main_index.sh > /dev/null 2 >&1 保存退出. 安装Sphinx过程中遇到的一些问题和解决方法: /usr/local/coreseek/bin/search --config /usr/local/sphinx/etc/sphinx.conf /usr/local/sphinx/bin/indexer: error while loading shared libraries: libmysqlclient.so.16: cannot open shared object file: No such file or directory 这个是因为coreseek(Sphinx)找不到 libmysqlclient.so引起的,解决方法: vi /etc/ld.so.conf 在最后面添加一行: /usr/lib/mysql ldconfig ./bootstrap: line 23: aclocal: command not found ./bootstrap: line 24: libtoolize: command not found 需要安装 libtoo libtool,解决方法: yum install autoconf automake libtoo libtool WARNING: source 'xml': xmlpipe2 support NOT compiled in. To use xmlpipe2, install missing XML libraries, reconfigure, and rebuild Sphinx 需要安装libxml2,解决方法: yum install libxml2 然后重新编译coreseek(Sphinx) 如果提示不支持charset_table的话,很可能你是运行标准版的Shpinx而不是coreseek,只有coreseek才支持这个属性.运行正常的coreseek路径就可以了. Generated by Bo-blog 2.1.1 Release