Solr4.8.0源码分析(5)之查询流程分析总述
前面已经写到,solr查询是通过http发送命令,solr servlet接受并进行处理。所以solr的查询流程从SolrDispatchsFilter的dofilter开始。dofilter包含了对http的各个请求的操作。Solr的查询方式有很多,比如q,fq等,本章只关注select和q。页面下发的查询请求如下:
1 @Override2 public void doFilter(ServletRequest request, ServletResponse response, FilterChain chain) throws IOException, ServletException {3 doFilter(request, response, chain, false);4 }
由于只关注select,实际的查询是从如下代码开始:this.execute()是查询的入口函数。这里需要注意下writeResponse()函数。execute只是获取了符合查询条件的doc id,最后在writeResponse()中会根据doc id获取stored属性的字段信息,并写入返回结果。
1 // With a valid handler and a valid core... 2 if( handler != null ) { 3 // if not a /select, create the request 4 if( solrReq == null ) { 5 solrReq = parser.parse( core, path, req ); 6 } 7 8 if (usingAliases) { 9 processAliases(solrReq, aliases, collectionsList);10 }11 12 final Method reqMethod = Method.getMethod(req.getMethod());13 HttpCacheHeaderUtil.setCacheControlHeader(config, resp, reqMethod);14 // unless we have been explicitly told not to, do cache validation15 // if we fail cache validation, execute the query16 if (config.getHttpCachingConfig().isNever304() ||17 !HttpCacheHeaderUtil.doCacheHeaderValidation(solrReq, req, reqMethod, resp)) {18 SolrQueryResponse solrRsp = new SolrQueryResponse();19 /* even for HEAD requests, we need to execute the handler to20 * ensure we don't get an error (and to make sure the correct21 * QueryResponseWriter is selected and we get the correct22 * Content-Type)23 */24 SolrRequestInfo.setRequestInfo(new SolrRequestInfo(solrReq, solrRsp));25 this.execute( req, handler, solrReq, solrRsp );26 HttpCacheHeaderUtil.checkHttpCachingVeto(solrRsp, resp, reqMethod);27 // add info to http headers28 //TODO: See SOLR-232 and SOLR-267. 29 /*try {30 NamedList solrRspHeader = solrRsp.getResponseHeader();31 for (int i=0; i
进入excute后会进入SolrCore的excute(), preDecorateResponse 对结果的头信息比如进行预处理,postDecorateResponse对将时间、返回结果写入response中。handleRequest继续进行查询操作。
1 public void execute(SolrRequestHandler handler, SolrQueryRequest req, SolrQueryResponse rsp) { 2 if (handler==null) { 3 String msg = "Null Request Handler '" + 4 req.getParams().get(CommonParams.QT) + "'"; 5 6 if (log.isWarnEnabled()) log.warn(logid + msg + ":" + req); 7 8 throw new SolrException(SolrException.ErrorCode.BAD_REQUEST, msg); 9 }10 11 preDecorateResponse(req, rsp);12 13 // TODO: this doesn't seem to be working correctly and causes problems with the example server and distrib (for example /spell)14 // if (req.getParams().getBool(ShardParams.IS_SHARD,false) && !(handler instanceof SearchHandler))15 // throw new SolrException(SolrException.ErrorCode.BAD_REQUEST,"isShard is only acceptable with search handlers");16 17 18 handler.handleRequest(req,rsp);19 postDecorateResponse(handler, req, rsp);20 21 if (log.isInfoEnabled() && rsp.getToLog().size() > 0) {22 log.info(rsp.getToLogAsString(logid));23 }24 }
RequestHandlerBase.handleRequest(SolrQueryRequest req, SolrQueryResponse rsp)再次调用了SearchHandle.handleRequestBody(SolrQueryRequest req, SolrQueryResponse rsp),这是时候才真正开始加载QueryComponents。
以下语句会加载查询有关的组件,包括QueryComponents,FacetComponents,MoreLikeThisComponent,HighlightComponent,StatsComponent,
DebugComponent,ExpandComponent。本文只关注查询,所以进入的QueryComponent.java.
for( SearchComponent c : components ) { c.process(rb);}
暂且不提QueryComponent.java中的关于Query的处理(查询的细节将在后面章节中说明,本章只作总述),QueryComponent.process
(ResponseBuilder rb) 会调用SolrindexSearch.search(QueryResult qr, QueryCommand cmd)进行查询,并在后续代码中对返回的结果进行处理,主要包括doFieldSortValues(rb, searcher);和doPrefetch(rb);
1 // normal search result 2 searcher.search(result,cmd); 3 rb.setResult( result ); 4 5 ResultContext ctx = new ResultContext(); 6 ctx.docs = rb.getResults().docList; 7 ctx.query = rb.getQuery(); 8 rsp.add("response", ctx); 9 rsp.getToLog().add("hits", rb.getResults().docList.matches());10 11 if ( ! rb.req.getParams().getBool(ShardParams.IS_SHARD,false) ) {12 if (null != rb.getNextCursorMark()) {13 rb.rsp.add(CursorMarkParams.CURSOR_MARK_NEXT, 14 rb.getNextCursorMark().getSerializedTotem());15 }16 }17 doFieldSortValues(rb, searcher);18 doPrefetch(rb);
SolrindexSearch.search函数比较简单,只是调用了SolrindexSearch.getDocListC.顾名思义,该函数返回了查询结果的doc id 的list。这时候才是真正的查询开始。查询之前,Solr会从queryResultCache缓存里面读取该条件的结果,queryResultCache里面存放了查询条件和查询结果的键值对。如果queryResultCache里面有这个查询条件,那Solr就会直接返回查询条件的值。如果没有该查询条件,则会进行正常查询,并把查询条件和查询命令写入queryResultCache的键值对里。queryResultCache具有容量大小,可以在solrconfig的缓存配置里进行配置。
1 // we can try and look up the complete query in the cache. 2 // we can't do that if filter!=null though (we don't want to 3 // do hashCode() and equals() for a big DocSet). 4 if (queryResultCache != null && cmd.getFilter()==null 5 && (flags & (NO_CHECK_QCACHE|NO_SET_QCACHE)) != ((NO_CHECK_QCACHE|NO_SET_QCACHE))) 6 { 7 // all of the current flags can be reused during warming, 8 // so set all of them on the cache key. 9 key = new QueryResultKey(q, cmd.getFilterList(), cmd.getSort(), flags);10 if ((flags & NO_CHECK_QCACHE)==0) {11 superset = queryResultCache.get(key);12 13 if (superset != null) {14 // check that the cache entry has scores recorded if we need them15 if ((flags & GET_SCORES)==0 || superset.hasScores()) {16 // NOTE: subset() returns null if the DocList has fewer docs than17 // requested18 out.docList = superset.subset(cmd.getOffset(),cmd.getLen());19 }20 }21 if (out.docList != null) {22 // found the docList in the cache... now check if we need the docset too.23 // OPT: possible future optimization - if the doclist contains all the matches,24 // use it to make the docset instead of rerunning the query.25 if (out.docSet==null && ((flags & GET_DOCSET)!=0) ) {26 if (cmd.getFilterList()==null) {27 out.docSet = getDocSet(cmd.getQuery());28 } else {29 ListnewList = new ArrayList<>(cmd.getFilterList().size()+1);30 newList.add(cmd.getQuery());31 newList.addAll(cmd.getFilterList());32 out.docSet = getDocSet(newList);33 }34 }35 return;36 }37 }38 39 // If we are going to generate the result, bump up to the40 // next resultWindowSize for better caching.41 42 if ((flags & NO_SET_QCACHE) == 0) {43 // handle 0 special case as well as avoid idiv in the common case.44 if (maxDocRequested < queryResultWindowSize) {45 supersetMaxDoc=queryResultWindowSize;46 } else {47 supersetMaxDoc = ((maxDocRequested -1)/queryResultWindowSize + 1)*queryResultWindowSize;48 if (supersetMaxDoc < 0) supersetMaxDoc=maxDocRequested;49 }50 } else {51 key = null; // we won't be caching the result52 }53 }
如果没有复合的缓存,那么将进行正常的查询。这里查询会走排序和非排序的查询分支(两个分支的差别将在后续文章中写道)。最后查询会进入getDocListNC(qr,cmd)函数继续进行查询。superset.subset()会对查询结果进行截断,比如我查询的结果start=20,row=40,那么Solr查询实际的结果是start=0,row=60,也就是至少说会查(start+row)个结果,然后再获取第20到第60的结果集。
if (useFilterCache) { // now actually use the filter cache. // for large filters that match few documents, this may be // slower than simply re-executing the query. if (out.docSet == null) { out.docSet = getDocSet(cmd.getQuery(),cmd.getFilter()); DocSet bigFilt = getDocSet(cmd.getFilterList()); if (bigFilt != null) out.docSet = out.docSet.intersection(bigFilt); } // todo: there could be a sortDocSet that could take a list of // the filters instead of anding them first... // perhaps there should be a multi-docset-iterator sortDocSet(qr, cmd); } else { // do it the normal way... if ((flags & GET_DOCSET)!=0) { // this currently conflates returning the docset for the base query vs // the base query and all filters. DocSet qDocSet = getDocListAndSetNC(qr,cmd); // cache the docSet matching the query w/o filtering if (qDocSet!=null && filterCache!=null && !qr.isPartialResults()) filterCache.put(cmd.getQuery(),qDocSet); } else { getDocListNC(qr,cmd); } assert null != out.docList : "docList is null"; } if (null == cmd.getCursorMark()) { // Kludge... // we can't use DocSlice.subset, even though it should be an identity op // because it gets confused by situations where there are lots of matches, but // less docs in the slice then were requested, (due to the cursor) // so we have to short circuit the call. // None of which is really a problem since we can't use caching with // cursors anyway, but it still looks weird to have to special case this // behavior based on this condition - hence the long explanation. superset = out.docList; out.docList = superset.subset(cmd.getOffset(),cmd.getLen()); } else { // sanity check our cursor assumptions assert null == superset : "cursor: superset isn't null"; assert 0 == cmd.getOffset() : "cursor: command offset mismatch"; assert 0 == out.docList.offset() : "cursor: docList offset mismatch"; assert cmd.getLen() >= supersetMaxDoc : "cursor: superset len mismatch: " + cmd.getLen() + " vs " + supersetMaxDoc; }
SolrIndexSearch.getDocListNC(qr,cmd)里面定义了许多Collector的内部类,不过暂时与本章节无关,所以直接查看以下这段代码。首先Solr会创建TopDocsCollector,它会存放所有复合查询条件的结果集。如果查询的时候设置了timeAllowed开关,那么查询就会走TimeLimitingCollector分支。TimeLimitingCollector是Collector的子类,当timeAllowed设定一个数字时,比如200ms,如果Solr查询一旦获取到结果就会在200ms内返回,不管查询的结果是否已经完整。可以看见最后查询过程最后调用了Lucene IndexSearch.Search(),这层开始进入Lucene.最后Solr会对TopDocsCollector的结果总数以及优先级队列进行处理。
1 final TopDocsCollector topCollector = buildTopDocsCollector(len, cmd); 2 Collector collector = topCollector; 3 if (terminateEarly) { 4 collector = new EarlyTerminatingCollector(collector, cmd.len); 5 } 6 if( timeAllowed > 0 ) { 7 collector = new TimeLimitingCollector(collector, TimeLimitingCollector.getGlobalCounter(), timeAllowed); 8 } 9 if (pf.postFilter != null) {10 pf.postFilter.setLastDelegate(collector);11 collector = pf.postFilter;12 }13 try {14 super.search(query, luceneFilter, collector);15 if(collector instanceof DelegatingCollector) {16 ((DelegatingCollector)collector).finish();17 }18 }19 catch( TimeLimitingCollector.TimeExceededException x ) {20 log.warn( "Query: " + query + "; " + x.getMessage() );21 qr.setPartialResults(true);22 }23 24 totalHits = topCollector.getTotalHits();25 TopDocs topDocs = topCollector.topDocs(0, len);26 populateNextCursorMarkFromTopDocs(qr, cmd, topDocs);27 28 maxScore = totalHits>0 ? topDocs.getMaxScore() : 0.0f;29 nDocsReturned = topDocs.scoreDocs.length;30 ids = new int[nDocsReturned];31 scores = (cmd.getFlags()&GET_SCORES)!=0 ? new float[nDocsReturned] : null;32 for (int i=0; i
进入Lucene的IndexSearch.Search()后,Solr开始对所有Segment进行遍历,AtomicReaderContext包含了Segment的所有信息,包括docbase,doc的个数。
遍历完后,会调用Weight.bulkScore()对多个条件进行重组,比如多个OR的条件组成一个条件,多个AND的查询条件再组成一个List。Weight.bulkScore()会对这个List按照查询条件的词频进行排序。对条件处理好以后,就是会从segment里面获取所有符合查询条件的doc id(具体的获取方法,在后续的文章里会详细介绍),这就是scorer.score(collector);的作用了。
1 /** 2 * Lower-level search API. 3 * 4 *5 * {
@link Collector#collect(int)} is called for every document. 6 * 7 *8 * NOTE: this method executes the searches on all given leaves exclusively. 9 * To search across all the searchers leaves use {
@link #leafContexts}.10 * 11 * @param leaves 12 * the searchers leaves to execute the searches on13 * @param weight14 * to match documents15 * @param collector16 * to receive hits17 * @throws BooleanQuery.TooManyClauses If a query would exceed 18 * { @link BooleanQuery#getMaxClauseCount()} clauses.19 */20 protected void search(Listleaves, Weight weight, Collector collector)21 throws IOException {22 23 // TODO: should we make this24 // threaded...? the Collector could be sync'd?25 // always use single thread:26 for (AtomicReaderContext ctx : leaves) { // search each subreader27 try {28 collector.setNextReader(ctx);29 } catch (CollectionTerminatedException e) {30 // there is no doc of interest in this reader context31 // continue with the following leaf32 continue;33 }34 BulkScorer scorer = weight.bulkScorer(ctx, !collector.acceptsDocsOutOfOrder(), ctx.reader().getLiveDocs());35 if (scorer != null) {36 try {37 scorer.score(collector);38 } catch (CollectionTerminatedException e) {39 // collection was terminated prematurely40 // continue with the following leaf41 }42 }43 }44 }
到这一步已经获取到符合查询条件的所有doc id了,但是我们的查询结果是需要显示多有的字段的,所以也就是说Solr后面还是会根据doc id再次取segment获取所有字段信息,至于这是在哪里实现的,在后续文章中会详细描述。
总结: Solr的查询过程还是比较绕的,且有很多可以优化的地方。本文主要简述了Solr查询的流程,对查询过程中的细节将在后续的文章里面具体阐述。