http://en.wikipedia.org/wiki/N-gram
아래의 코드는 3.0.1 을 다운받아서 sample로 구현한것인데 결과를 먼저 보자.
"아버지가방에들어가신다" 를 색인하면 단어 n개의 연쇄를 추출해서 색인하는데, 이 때문에 "아버지"를 검색하면 걸려든다. 물론 단점으로는 "가방"을 검색해도 걸려든다. (-_-;)
Optimizing index... 188 total milliseconds Term: content:가방 Term: content:가신 Term: content:들어 Term: content:방에 Term: content:버지 Term: content:신다 Term: content:아버 Term: content:어가 Term: content:에들 Term: content:지가 Term: seqid:2 Searching for: 가방 1 total matching documents My seq ID: 2
이것도 몇 년만이라 20분정도를 소요했다. (-_-;;)
public void testLucene() { try { File index = new File("index"); Date start = new Date(); IndexWriter writer = new IndexWriter(FSDirectory.open(index), new CJKAnalyzer(Version.LUCENE_30), true, new IndexWriter.MaxFieldLength(1000000)); Document doc = new Document(); doc .add(new Field("seqid", "2", Field.Store.YES, Field.Index.NOT_ANALYZED)); doc.add(new Field("content", "아버지가방에들어가신다", Field.Store.YES, Field.Index.ANALYZED)); writer.addDocument(doc); System.out.println("Optimizing index..."); writer.optimize(); writer.close(); Date end = new Date(); System.out.print(end.getTime() - start.getTime()); System.out.println(" total milliseconds"); IndexReader reader = IndexReader.open( FSDirectory.open(new File("index")), true); TermEnum termEnum = reader.terms(); while (termEnum.next() == true) { Term term = termEnum.term(); System.out.println("Term: " + term); } Searcher searcher = new IndexSearcher(reader); Analyzer analyzer = new CJKAnalyzer(Version.LUCENE_30); QueryParser parser = new QueryParser(Version.LUCENE_30, "content", analyzer); System.out.println("Searching for: 가방"); Query query = parser.parse("가방"); TopScoreDocCollector collector = TopScoreDocCollector.create(50, false); searcher.search(query, collector); ScoreDoc[] hits = collector.topDocs().scoreDocs; int numTotalHits = collector.getTotalHits(); System.out.println(numTotalHits + " total matching documents"); collector = TopScoreDocCollector.create(numTotalHits, false); searcher.search(query, collector); hits = collector.topDocs().scoreDocs; for (int i = 0; i < hits.length; i++) { Document docss = searcher.doc(hits[i].doc); String path = docss.get("seqid"); System.out.println("My seq ID: " + path); } } catch (Exception e) { e.printStackTrace(); } } }