Sunspot 笔记

Posted on May 27, 2014

安装

  1. 添加GEM

    gem 'sunspot_rails', '~> 1.3.0'
    
    group :development do
      gem 'sunspot_solr' # optional pre-packaged Solr distribution for use in development
      gem 'progress_bar' # 重建索引时输出进度
    end
    
  2. rails generate sunspot_rails:install 生成配置文件config/sunspot.yml

    development:
      solr:
        hostname: localhost
        port: 8982
        log_level: INFO
        path: /solr/development #需要手动添加,否则启动不了
    

启动

rake sunspot:solr:start 启动solr服务, 第一次会创建solr/

http://localhost:8982/solr/admin/ 查看修改配置

当数据库数据发生改变时,solr会自动重建索引,但是如果修改了AD的searchable配置,需要手动重建索引

rake sunspot:reindex

Setting up classes for search and indexing

  1. 填充字段方法

    当sunspot索引ruby对象时, 会基于AD对象提取数据,然后创建Solr文档(Solr document), 有2种提取字段数据的方法:

    • attribute extraction

      即通过调用AD对象的方法, 使用返回值作为索引的内容

      boolean :featured, :using => :featured? featured 只是一个标示, 如果没有设置using, 将在AD上调用featured, 否则调用 using设置的AD方法名

    • block extraction

    text :author_names do |post|                      #block参数代表每个AD对象
      post.authors.map { |author| author.full_name }
    end
    
  2. 文字字段

    • text

      唯一支持全文检索的类型

      When text fields are indexed, they are broken up into their constituent words and then processed using a definable set of filters (with Sunspot’s default Solr installation, they’re just lower-cased). This process is known as tokenization, and it’s what allow text fields to be searched using fulltext matching

    • boost

      当文字字段被索引时, 每个文档都有一个相关度score, based on where the searched words appear in the document, how many times they appear in the document, and how common they are in the index as a whole

      可以指定boost: 如在searchable里boost { featured? ? 2.0 : 1.0 } 或者 boost 1.2

  3. 属性字段

    属性字段主要用于 scoping, faceting, ordering等, 不想text字段, 属性字段不会进行分词, 索引后会逐字扫描

    可用类型:

    • string: 不会进行分词

    • integer:

      Deal.search { keywords('碎花半身裙'); with(:tag_id, 2) }

    • float:

    • time: 存储date/time类型, 等同于ruby中Time类型

    • date: 等同于ruby中Date类型,但是不是Solr内置类型

    • boolean:

    • trie:

    附加属性:

    • multiple: bool型,表明方法/block是否返回多值(Array), 这种类型字段不能用于排序

    • references: Class型, 表明这个字段返回值是指定class的主键

    • stored: bool型, 如果是true,表明在建立索引时,该字段值将被存入hit对象,此时获取该值就不用再次查询数据库

      text :title, default_boost: 2, stored: true
      search.hits.first.stored(:title)  #不用查询数据库
      
    • trie: 只对数字和时间字段有用


  1. 主要类型

    search = Deal.search { keywords('碎花半身裙') } # Sunspot::Search::StandardSearch 不会查询数据库
    
    search.hits                                    # 返回Sunspot::Search::Hit的一个数组,不会查库
    search.hits.first.result                       # 返回对应的AD对象, 会把所有hits的AD对象都查出来
    
    ss.total                                       # 结果的数量
    
    search.each_hit_with_result do |hit, deal|      # 会生成sql查库:Deal Load (0.2ms)  SELECT `deals`.* FROM `deals` WHERE `deals`.`id` IN (27795, 32717, 38406, 229341, 41062, 7841, 25940, 26906, 49364, 43280, 88175, 350190)
      puts hit.class                                # Sunspot::Search::Hit
      puts hit.score
      puts deal.title
    end
    
    search.results                                 # AD 对象数组, 查库
    
  2. 一次搜索多个: search = Sunspot.search(Post, Comment)

  3. 分页

    Deal.search do
      keywords('碎花半身裙')
      paginate(page: 2, per_page: 3)
    end
    
    <%= will_paginate(@search.hits) %>
    
  4. Keyword highlighting

    TODO

  5. 假定不一致和验证hits

    Solr不是为实时查询而生, 所以会有数据短期不一致的情况

    Sunspot 奉行”假定不一致”原则: it doesn’t break if the Solr results reference an object that doesn’t actually exist in the database

    results each_hit_with_result 在查询数据库后,会扔掉查找不到的数据

    hits 的目的是不查询数据,所以该方法不会验证数据存在否,如果需要强制验证,可以hits(:verify => true)

  6. Working with facets

    TODO


Fulltext search in Sunspot uses Solr’s dismax handler. The dismax handler is designed for parsing user-entered search phrases, and provides a good balance between functionality and error-proofing. A small subset of the normal boolean query syntax is parsed: in particular, well-matched quotation marks can be used to demarcate phrases, and the + and - operators can be used to require or exclude words respectively. All other Solr boolean syntax is escaped, and non-well-matched quotation marks are ignored. This is fairly consistent with the functionality users are accustomed to with search on the web and elsewhere

Sunspot.search(Post) do
  keywords 'great pizza' # 这是2个搜索关键词, 会搜索所有text定义字段, 如果在一个search中多次调用keywords,后者覆盖前者
end

keywords 'great pizza', :fields => [:title, :body]  #指定搜索字段
keywords 'great pizza', :highlight => true          #结果高亮, 声明需要stored => true 才能起作用

keywords 和 fulltext貌似是别名关系

Post.search do
  fulltext 'pizza' do
     boost_fields :title => 2.0             #提高匹配到某一字段的score
  end
end

Post.search do
  fulltext 'pizza' do
    boost(2.0) { with(:featured, true) }    #这个with不是search限制条件,而是用于提高score的条件
  end
end

Post.search do
  fulltext 'pizza' do
    fields(:title)               #指定搜索字段
  end
end


Post.search do
  fulltext 'pizza' do
    fields(:body, :title => 2.0) #指定搜索字段,并且匹配title的更高
  end
end

Phrases

Phrases 指的是靠在一起的词(search terms that are close together) (含有空格的搜索词)

双引号中的是phrase: fulltext '"great pizza"'

query_phrase_slop 意思是词之间可以包含的其他词, 如

Post.search do
  fulltext '"great pizza"' do  # 匹配"great big pizza" 和 "great pizza"
    query_phrase_slop 1
  end
end

Phrase Boosts: TODO


Scoping by attribute fields

  1. 范围查询

    类似sql条件, 以with开始, 参数需要是searchable中声明的字段, 且只能是属性字段, 不能是text字段

    • equal_to
    • less_than
    • greater_than
    • less_than_or_equal_to
    • greater_than_or_equal_to
    • between
    • any_of
    • all_of 可用于组合与或关系
     with(:published_at).less_than(Time.now)
     without(:category_ids, 2)               #否定o
    
     post = Post.find(params[:id])
     Sunspot.search(Post) do
       without(post)                         #排除指定对象
     end
    
     Sunspot.search(Post) do                 #多条件与
       with(:published_at).less_than(Time.now)
       with(:blog_id, 1)
     end
    
     Sunspot.search(Post) do                  # 或关系
       any_of do
         with(:expired_at).greater_than(Time.now)
         with(:expired_at, nil)
       end
     end
    

    简写:

     with(:blog_id, 1)
     with(:blog_id).equal_to(1)
    
     with(:average_rating, 3.0..5.0)
     with(:average_rating).between(3.0..5.0)
    
     with(:category_ids, [1, 3, 5])
     with(:category_ids).any_of([1, 3, 5])
    
  2. 空值

    空值只能传递给equal_to with(:expired_at, nil)

    空值 with(:category_ids, id_list) if id_list.present? 等于什么也没做,所以等价于 with(:category_ids, id_list)

Note that, other than passing nil into an equality restriction, no restriction will ever match a field that has no value. For instance, neither with(:published_at).less_than(Time.now) nor with(:published_at).greater_than(Time.now) will match documents which have no published_at set at all. This will come as a surprise to those who are used to SQL queries but is a natural fact of the way Solr’s range queries work


Ordering and pagination

  1. 排序

    Sunspot运行order非text字段多次, 一次search里可以调用多次order_by, 越早调用的优先级越高, 如果没有order_by, 将会按照score排序

     order_by(:blog_id, :asc)
     order_by(:created_at, :desc)
    

    如果要用text排序,必须另外声明一个string类型

    有2个特殊的排序类型: :score :random

  2. 分页

    所有search都会被分页, per_page默认30, 显示设定: paginate(:page => 2, :per_page => 15)

    如需修改: Sunspot.config.pagination.default_per_page = 30 (在哪里修改??)


Drilling down with facets

类似sql中的group by + count(*)

search = Deal.search do
  keywords('碎花半身裙')
  facet(:tag_id)
end

search.facets                           #Sunspot::Search::FieldFacet 数组
search.facet(:tag_id)                   #Sunspot::Search::FieldFacet 对象
search.facet(:tag_id).rows              #Sunspot::Search::FacetRow 数组
search.facet(:tag_id).rows.each do |r| 
  r.value                               #分组内容(tag_id)
  r.count                               #分组数量
  r.instance                            #如果声明facet指定了references: SomeClass, row才会有这个方法,返回AD对象(引起查库)
end

需要注意facet不受分页限制,返回的满足查询条件的所有数据统计

  • Date Facets

    TODO


Dynamic fields

TODO


Advanced Fulltext Search Configuration

Sunspot全文检索默认策略:Text is divided into tokens based on whitespace and other delimiter characters using a smart tokenizer called the StandardTokenizer

使用LowerCaseFilter以便全文检索大小写无关

配置文件solr/conf/schema.xml, 常用高级配置:

支持同义词: 在index-time <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true" />

Porter stem: both index-time and query-time: 采用Porter Stemming Algorithm算法去掉单词的后缀,例如将复数形式变成单数形式,第三人称动词变成第一人称,现在分词变成一般现在时的动词 <filter class="solr.PorterStemFilterFactory"/>


参考资料