邢红瑞的blog--解决Word文档的检索问题，lucene我的天职是搜索

本站首页管理页面写新日志退出

« November 2025 »
日一二三四五六
1
2 3 4 5 6 7 8
9 10 11 12 13 14 15
16 17 18 19 20 21 22
23 24 25 26 27 28 29
30

公告

戒除浮躁，读好书，交益友

我的分类（专题）

首页(523)
生活杂事(38)
脚本语言(15)
template engine(3)
opensource(4)
数据库(23)
c++(68)
linux kernel(20)
jvm(22)
java语言(118)
web开发(1)
开发工具(35)
算法与数据结构(0)
orm(4)
linux(37)
软件项目管理(15)
j2ee(67)
编程感想(45)
PKI(7)
UTM(16)
rootkit(9)
concurrent(0)
multicore(0)
WAF(2)

日志更新

ubuntu下安装vmware
ubuntu删除vmware
nginx配置ssl
半价售书限北京
2012年的计划
centos安装LiHei Pro字体
fedora 15 root不能登陆修
secrt在实现vim彩色显示
vc9编译openvpn2.2.1
如何调试nginx

留言板

签写新留言

求助
mysql5.0.45客户端登陆hang
关于jdk本地代码
哈哈，看来国内的产权保护意识越来越浓了，

链接

尚老大的blog
cyt
黑夜路人的开源世界
庄周梦蝶
熔岩
 成都心情
 龙居
 mmwy
jackyrong
猩猩的空间
 他山之石可以攻玉
 坏男孩
 上善若水
 杨中科
 蛟龍居
 周波的Blog
小明思考
 sysnap

Blog信息

blog名称:邢红瑞的blog
日志总数:523
评论数量:1142
留言数量:0
访问次数:9743274
建立时间:2004年12月20日

[java语言]解决Word文档的检索问题，lucene我的天职是搜索
原创空间, 软件技术

邢红瑞发表于 2005/11/20 13:08:37

lunece是个姓氏，Lucene is Doug’s wife’s middle name; it’s also her maternal grandmother’s first name. 看了车东老大的blog，针对MSWord文档的解析器，因为Word文档和基于ASCII的RTF文档不同，需要使用COM对象机制解析。其实apache的POI完全可以做到解析MSWord文档。我修改了别人的一个例子，算是抛砖引玉，大家不要那转头打我。 Lucene并没有规定数据源的格式，而只提供了一个通用的结构（Document对象）来接受索引的输入，但好像只能是文本数据。 package org.tatan.framework; import java.io.PrintStream; import java.io.PrintWriter; public class DocumentHandlerException extends Exception { private Throwable cause; /** * Default constructor. */ public DocumentHandlerException() { super(); } /** * Constructs with message. */ public DocumentHandlerException(String message) { super(message); } /** * Constructs with chained exception. */ public DocumentHandlerException(Throwable cause) { super(cause.toString()); this.cause = cause; } /** * Constructs with message and exception. */ public DocumentHandlerException(String message, Throwable cause) { super(message, cause); } /** * Retrieves nested exception. */ public Throwable getException() { return cause; } public void printStackTrace() { printStackTrace(System.err); } public void printStackTrace(PrintStream ps) { synchronized (ps) { super.printStackTrace(ps); if (cause != null) { ps.println("--- Nested Exception ---"); cause.printStackTrace(ps); } } } public void printStackTrace(PrintWriter pw) { synchronized (pw) { super.printStackTrace(pw); if (cause != null) { pw.println("--- Nested Exception ---"); cause.printStackTrace(pw); } } } } 解析MSWORD的类 package org.tatan.framework; import org.apache.poi.hdf.extractor.WordDocument; import java.io.InputStream; import java.io.StringWriter; import java.io.PrintWriter; public class POIWordDocHandler { public String getDocument(InputStream is) throws DocumentHandlerException { String bodyText = null; try { WordDocument wd = new WordDocument(is); StringWriter docTextWriter = new StringWriter(); wd.writeAllText(new PrintWriter(docTextWriter)); docTextWriter.close(); bodyText = docTextWriter.toString(); } catch (Exception e) { throw new DocumentHandlerException( "Cannot extract text from a Word document", e); } if ((bodyText != null) && (bodyText.trim().length() > 0)) { return bodyText; } return null; } } 建立索引的类 package org.tatan.framework; import org.apache.lucene.index.IndexWriter; import org.apache.lucene.analysis.standard.StandardAnalyzer; import org.apache.lucene.document.Document; import org.apache.lucene.document.Field; import java.io.File; import java.io.FileInputStream; import java.io.IOException; import java.util.Date; public class Indexer { public static void main(String[] args) throws Exception { File indexDir = new File("d:/testdoc/index"); File dataDir = new File("d:/testdoc/msword"); long start = new Date().getTime(); int numIndexed = index(indexDir, dataDir); long end = new Date().getTime(); System.out.println("Indexing " + numIndexed + " files took " + (end - start) + " milliseconds"); } public static int index(File indexDir, File dataDir) throws Exception { if (!dataDir.exists() || !dataDir.isDirectory()) { throw new IOException(dataDir + " does not exist or is not a directory"); } IndexWriter writer = new IndexWriter(indexDir, new CJKAnalyzer(), true) writer.setUseCompoundFile(false); indexDirectory(writer, dataDir); int numIndexed = writer.docCount(); writer.optimize(); writer.close(); return numIndexed; } private static void indexDirectory(IndexWriter writer, File dir) throws Exception { File[] files = dir.listFiles(); for (int i = 0; i < files.length; i++) { File f = files[i]; if (f.isDirectory()) { indexDirectory(writer, f); // recurse } else if (f.getName().endsWith(".doc")) { indexFile(writer, f); } } } private static void indexFile(IndexWriter writer, File f) throws Exception { if (f.isHidden() || !f.exists() || !f.canRead()) { return; } System.out.println("Indexing " + f.getCanonicalPath()); Document doc = new Document(); POIWordDocHandler handler = new POIWordDocHandler(); doc.add(Field.UnStored("body", handler.getDocument(new FileInputStream(f)))); doc.add(Field.Keyword("filename", f.getCanonicalPath())); writer.addDocument(doc); } } 要注意的问题：使用Field对象UnStored函数，只全文索引，不存储。检索的类 package org.tatan.framework; import org.apache.lucene.document.Document; import org.apache.lucene.queryParser.QueryParser; import org.apache.lucene.search.Hits; import org.apache.lucene.search.IndexSearcher; import org.apache.lucene.search.Query; import org.apache.lucene.store.Directory; import org.apache.lucene.store.FSDirectory; import org.apache.lucene.analysis.Token; import org.apache.lucene.analysis.cjk.CJKAnalyzer; public class Searcher { public static void main(String[] args) throws Exception { Directory fsDir = FSDirectory.getDirectory("D:\\testdoc\\index", false); IndexSearcher is = new IndexSearcher(fsDir); Token[] tokens = AnalyzerUtils.tokensFromAnalysis(new CJKAnalyzer(), "一人一情"); for (int i = 0; i < tokens.length; i++) { Query query = QueryParser.parse(tokens[i].termText(), "body", new CJKAnalyzer()); Hits hits = is.search(query); for (int j = 0; j < hits.length(); j++) { Document doc = hits.doc(j); System.out.println(doc.get("filename")); } } } } 要注意的问题：不要使用TermQuery检索不出中文，目前还有中文切词功能。

阅读全文(7460) | 回复(3) | 编辑 | 精华

回复:解决Word文档的检索问题，lucene我的天职是搜索
原创空间, 软件技术

dcl_dcl(游客)发表评论于2007/1/24 9:35:45

你好,我在仿照你的这个例子完成后,索引word文档,发现只能索引文档开始的一部分,而不能全文检索.应该不是lucene的问题吧?是不是poi哪儿需要什么设置???我是初学请帮我解答这个问题,谢谢.或者哪儿有这方面比较详细的资料.我的邮箱是dongchangliang@sina.com

个人主页 | 引用回复 | 主人回复 | 返回 | 编辑 | 删除

回复:解决Word文档的检索问题，lucene我的天职是搜索
原创空间, 软件技术

软件下载(游客)发表评论于2006/5/13 2:08:07

如何,实现搜索不区分大小写呢?

个人主页 | 引用回复 | 主人回复 | 返回 | 编辑 | 删除

回复:解决Word文档的检索问题，lucene我的天职是搜索
原创空间, 软件技术

iori(游客)发表评论于2006/2/6 20:28:24

我打，Thanks

个人主页 | 引用回复 | 主人回复 | 返回 | 编辑 | 删除

» 1 »

发表评论：

昵称：
密码：
主页：
标题：

验证码： (不区分大小写,请仔细填写,输错需重写评论内容！)

站点首页 | 联系我们 | 博客注册 | 博客登陆

Sponsored By W3CHINA
W3CHINA Blog 0.8 Processed in 0.188 second(s), page refreshed 144821442 times.
《全国人大常委会关于维护互联网安全的决定》《计算机信息网络国际联网安全保护管理办法》
苏ICP备05006046号