« | August 2025 | » | 日 | 一 | 二 | 三 | 四 | 五 | 六 | | | | | | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 | | | | | | | |
| 公告 |
戒除浮躁,读好书,交益友 |
Blog信息 |
blog名称:邢红瑞的blog 日志总数:523 评论数量:1142 留言数量:0 访问次数:9702041 建立时间:2004年12月20日 |

| |
[java语言]解决Word文档的检索问题,lucene我的天职是搜索 原创空间, 软件技术
邢红瑞 发表于 2005/11/20 13:08:37 |
lunece是个姓氏,Lucene is Doug’s wife’s middle name; it’s also her maternal grandmother’s first name.
看了车东老大的blog,针对MSWord文档的解析器,因为Word文档和基于ASCII的RTF文档不同,
需要使用COM对象机制解析。其实apache的POI完全可以做到解析MSWord文档。
我修改了别人的一个例子,算是抛砖引玉,大家不要那转头打我。
Lucene并没有规定数据源的格式,而只提供了一个通用的结构(Document对象)来接受索引的输入,
但好像只能是文本数据。
package org.tatan.framework;
import java.io.PrintStream;
import java.io.PrintWriter;
public class DocumentHandlerException extends Exception {
private Throwable cause;
/**
* Default constructor.
*/
public DocumentHandlerException() {
super();
}
/**
* Constructs with message.
*/
public DocumentHandlerException(String message) {
super(message);
}
/**
* Constructs with chained exception.
*/
public DocumentHandlerException(Throwable cause) {
super(cause.toString());
this.cause = cause;
}
/**
* Constructs with message and exception.
*/
public DocumentHandlerException(String message, Throwable cause) {
super(message, cause);
}
/**
* Retrieves nested exception.
*/
public Throwable getException() {
return cause;
}
public void printStackTrace() {
printStackTrace(System.err);
}
public void printStackTrace(PrintStream ps) {
synchronized (ps) {
super.printStackTrace(ps);
if (cause != null) {
ps.println("--- Nested Exception ---");
cause.printStackTrace(ps);
}
}
}
public void printStackTrace(PrintWriter pw) {
synchronized (pw) {
super.printStackTrace(pw);
if (cause != null) {
pw.println("--- Nested Exception ---");
cause.printStackTrace(pw);
}
}
}
}
解析MSWORD的类
package org.tatan.framework;
import org.apache.poi.hdf.extractor.WordDocument;
import java.io.InputStream;
import java.io.StringWriter;
import java.io.PrintWriter;
public class POIWordDocHandler {
public String getDocument(InputStream is)
throws DocumentHandlerException {
String bodyText = null;
try {
WordDocument wd = new WordDocument(is);
StringWriter docTextWriter = new StringWriter();
wd.writeAllText(new PrintWriter(docTextWriter));
docTextWriter.close();
bodyText = docTextWriter.toString();
}
catch (Exception e) {
throw new DocumentHandlerException(
"Cannot extract text from a Word document", e);
}
if ((bodyText != null) && (bodyText.trim().length() > 0)) {
return bodyText;
}
return null;
}
}
建立索引的类
package org.tatan.framework;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.util.Date;
public class Indexer {
public static void main(String[] args) throws Exception {
File indexDir = new File("d:/testdoc/index");
File dataDir = new File("d:/testdoc/msword");
long start = new Date().getTime();
int numIndexed = index(indexDir, dataDir);
long end = new Date().getTime();
System.out.println("Indexing " + numIndexed + " files took "
+ (end - start) + " milliseconds");
}
public static int index(File indexDir, File dataDir)
throws Exception {
if (!dataDir.exists() || !dataDir.isDirectory()) {
throw new IOException(dataDir
+ " does not exist or is not a directory");
}
IndexWriter writer = new IndexWriter(indexDir,
new CJKAnalyzer(), true)
writer.setUseCompoundFile(false);
indexDirectory(writer, dataDir);
int numIndexed = writer.docCount();
writer.optimize();
writer.close();
return numIndexed;
}
private static void indexDirectory(IndexWriter writer, File dir)
throws Exception {
File[] files = dir.listFiles();
for (int i = 0; i < files.length; i++) {
File f = files[i];
if (f.isDirectory()) {
indexDirectory(writer, f); // recurse
} else if (f.getName().endsWith(".doc")) {
indexFile(writer, f);
}
}
}
private static void indexFile(IndexWriter writer, File f)
throws Exception {
if (f.isHidden() || !f.exists() || !f.canRead()) {
return;
}
System.out.println("Indexing " + f.getCanonicalPath());
Document doc = new Document();
POIWordDocHandler handler = new POIWordDocHandler();
doc.add(Field.UnStored("body", handler.getDocument(new FileInputStream(f))));
doc.add(Field.Keyword("filename", f.getCanonicalPath()));
writer.addDocument(doc);
}
}
要注意的问题:使用Field对象UnStored函数,只全文索引,不存储。
检索的类
package org.tatan.framework;
import org.apache.lucene.document.Document;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.search.Hits;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.analysis.Token;
import org.apache.lucene.analysis.cjk.CJKAnalyzer;
public class Searcher {
public static void main(String[] args) throws Exception {
Directory fsDir = FSDirectory.getDirectory("D:\\testdoc\\index", false);
IndexSearcher is = new IndexSearcher(fsDir);
Token[] tokens
= AnalyzerUtils.tokensFromAnalysis(new CJKAnalyzer(), "一人一情");
for (int i = 0; i < tokens.length; i++) {
Query query =
QueryParser.parse(tokens[i].termText(), "body", new CJKAnalyzer());
Hits hits = is.search(query);
for (int j = 0; j < hits.length(); j++) {
Document doc = hits.doc(j);
System.out.println(doc.get("filename"));
}
}
}
}
要注意的问题:不要使用TermQuery检索不出中文,目前还有中文切词功能。 |
|
回复:解决Word文档的检索问题,lucene我的天职是搜索 原创空间, 软件技术
dcl_dcl(游客)发表评论于2007/1/24 9:35:45 |
你好,我在仿照你的这个例子完成后,索引word文档,发现只能索引文档开始的一部分,而不能全文检索.应该不是lucene的问题吧?是不是poi哪儿需要什么设置???我是初学请帮我解答这个问题,谢谢.或者哪儿有这方面比较详细的资料.我的邮箱是dongchangliang@sina.com |
|
回复:解决Word文档的检索问题,lucene我的天职是搜索 原创空间, 软件技术
软件下载(游客)发表评论于2006/5/13 2:08:07 |
|
回复:解决Word文档的检索问题,lucene我的天职是搜索 原创空间, 软件技术
iori(游客)发表评论于2006/2/6 20:28:24 |
|
» 1 »
|