joeyta備忘記: Apache Lucene(Search Engine)備忘記

Apache Lucene 是一個以java編寫, 具有高效率, 支持全文檢索的開源搜索引擎.
主要功能是對數據作索引及搜索,
它與其他工具配合能處理 word, html, pdf, excel 的全文搜尋.
Lucene 名字的來源是取其始創人 Doug Cutting 的妻子中間名及她外祖母的首名.
Lucene 已成功地應用在 Eclipse , Jive , Ifinder 等不同的領域.

開始備忘記：
首先安裝Tomcat 5.x.
http://apache.seekmeup.com/tomcat/tomcat-5/v5.5.17/bin/apache-tomcat-5.5.17.exe
下載後直接安裝到目錄 D:\tomcat
由於本人電腦安裝了幾個 tomcat , 故把 port設為8083 ( 預設port是8080 )
http://localhost:8083/ 測試是否安裝成功.

下載 lucene-2.0.0.zip
http://www.apache.org/dyn/closer.cgi/lucene/java/

解壓後把 luceneweb.war 放進 D:\tomcat\webapps 目錄下.
啟動 tomcat 就會自動產生 D:\tomcat\webapps\luceneweb

建立 indexing：
建立儲存index目錄 D:\lucene\index
複製 lucene-2.0.0.zip 裡的 docs 目錄至 D:\tomcat\webapps\luceneweb. ( 用作測試資料 )
將 lucene-core-2.0.0.jar 及 lucene-demos-2.0.0.jar 加入到環境變數 CLASSPATH 裡.
打開 DOS command prompt, 執行指令：
java org.apache.lucene.demo.IndexHTML -create -index "D:\lucene\index" "D:\tomcat\webapps\luceneweb"
D:\lucene\index 為 index 儲存的目錄.
D:\tomcat\webapps\luceneweb 為整個web裡搜尋的根目錄.

輸出:
adding D:/tomcat/webapps/luceneweb/docs/mailinglists.html
adding D:/tomcat/webapps/luceneweb/docs/queryparsersyntax.html
.....
.....
adding D:/tomcat/webapps/luceneweb/docs/resources.html
adding D:/tomcat/webapps/luceneweb/docs/systemproperties.html
adding D:/tomcat/webapps/luceneweb/docs/whoweare.html
Optimizing index...
53389 total milliseconds
表示建立 indexing 成功. 下圖為 indexing 產生的檔案：

修改 c:\tomcat\webapps\luceneweb\configuration.jsp
將 String indexLocation = "/opt/lucene/index";
改為：
String indexLocation = "D:\\lucene\\index";

瀏覽 http://localhost:8083/luceneweb/ 輸入 Search Criteria 後,
按 Search 後就會出現下面的error：
parse(java.lang.String) in org.apache.lucene.queryParser.QueryParser cannot be applied to ...

這是 luceneweb web project 的 bug
修改 c:\tomcat\webapps\luceneweb\result.jsp 第81行
將 query = QueryParser.parse(queryString, "contents", analyzer); //parse the
改為：
QueryParser qp = new QueryParser("contents", analyzer);
query = qp.parse(queryString);

並需要修改 c:\tomcat\webapps\luceneweb\result.jsp 第129行
將 String url = doc.get("url"); //get its url field
改為 String url = doc.get("path"); //get its url field
由於使用 IndexHTML 做 indexing 的時候, 並不會產生 url field, 而相對的產生了 path field.
lucenweb 的問題不少, 如果可以的話最好自己寫.

再次瀏覽 http://localhost:8083/luceneweb/ , 畫面如下：

在 Search Criteria 輸入 "lucene" 後, 按 Search 後就會出現如下畫面：

以下有一些官方的教學, 少得可憐：
http://lucene.apache.org/java/docs/gettingstarted.html

有問題可到下面發問：
http://www.gossamer-threads.com/lists/lucene/java-user/

使用 Luke 瀏覽 Index：
Luke　為 Lucene Index 檢視工具, 是以Java 開發的桌面應用程式.
可點選下面連結直接啟動 Java Web Start version ( 亦可到該網站直接下載binary執行)：
http://www.getopt.org/luke/webstart.html

打開後如下畫面：

選擇 File -> Open Lucene Index
Browse 我們剛剛建立的 index 目錄 D:\lucene\index, 按 ok 後就會出現如下畫面：

整個介面使用非常簡單, 可參考 Luke 官方的教學：
http://www.getopt.org/luke/

如果要配合 weblucene, 下面為 weblucene 官方的教學：
http://www.chedong.com/tech/weblucene.html

1 則留言:

匿名說...: 這是 luceneweb web project 的 bug
修改 c:\tomcat\webapps\luceneweb\result.jsp 第81行
將 query = QueryParser.parse(queryString, "contents", analyzer); //parse the
改為：
QueryParser qp = new QueryParser("contents", analyzer);
query = qp.parse(queryString);

並需要修改 c:\tomcat\webapps\luceneweb\result.jsp 第129行
將 String url = doc.get("url"); //get its url field
改為 String url = doc.get("path"); //get its url field
由於使用 IndexHTML 做 indexing 的時候, 並不會產生 url field, 而相對的產生了 path field.
lucenweb 的問題不少, 如果可以的話最好自己寫.

這兩個bug在2.2版已經修復了...^^; 11:49 上午

發佈留言

joeyta備忘記

星期日, 7月 09, 2006

Apache Lucene(Search Engine)備忘記

1 則留言:

網誌封存

關於我

追蹤者