Lingucomponent Grammar Checking

Lingucomponent Grammar Checking

OffiDocs provides OpenOffice online so we are interested in the Lingucomponent Grammar Checking whose goals are to design, develop, and implement a Grammar checker for English and other supported languages. A grammar checker API has been available in OpenOffice.org since version 3.0.

Grammar checking is seen as a particular implementation of a text iteration and markup process, other iteration/markup processes like spell checking or smart tagging basically can work in the same way (though currently they are not implemented like this). If grammar checking is mentioned in the following documentation this can be seen as a placeholder for the more general task of text markup. As the objects carrying out the text iteration are aware of the particular markup process they are used for it is basically possible to fine tune the iteration for the needs of that process.

The grammar checking process consists of

- one or more documents to be checked

- one or more grammar checker implementations, each supporting at least one language.

- one or more grammar check dialogs (at most one instance per document)

- one context menu when clicking on text marked as incorrect

- a global grammar checking iterator (common to all documents) implemented as singleton, checking one sentence (of an arbitrary document) at a time.

- one thread object per grammar checker that is used to perform the checking without blocking the GUI

- objects iterating through the text of a document, one object representing a single grammar checking task that was requested

- objects representing text blocks in a text document (“flat paragraphs”) that abstract from the concrete structure of the document and provide access to the text by simple text strings and integer values describing positions and lengths of sub string.

 

Objects and their interfaces

We have three parts working together. The first part comes from the document being checked and it is an implementation that is specific for the particular type of document (e.g. Writer or Calc). It encapsulates the access to the text of the document. A document wanting to become checked for grammar errors must support the interface com.sun.star.text.XFlatParagraphIteratorProvider. Through this interface it must be able to provide objects implementing com.sun.star.text.XFlatParagraphIterator that themselves return objects implementing com.sun.star.text.XFlatParagraph. The latter interface is derived from com.sun.star.text.XTextMarkup. In the following we will call these objects "flat paragraph iterators" (FPIterator) and "flat paragraphs" (FP). If the word "paragraph" is used this will also denote an "FP", not a real paragraph in the document as not always both are the same.

An FP is not necessarily a paragraph as in the documents context, it can be a collection of them (e.g. a list) and it not only contains the flow text but also other text content like text frames, headers and footers etc. As only the document core can handle such FP objects objects efficiently this is a document specific implementation. The FP does not reveal the complete internal text structure or its attributes, its content is only accessible as a string containing the complete text block.

An FPIterator is an object that allows to iterate through all the FP objects that together make up the document text content. The order in which the paragraphs are iterated is arbitrary and an implementation detail of the FPIterator. The "regular" text content usually should be provided in reading direction, but how other text like headers and footers (that exist only once but are repeated on every page) or text frames (that may be embedded into the flow text) fit in is not predetermined. Iterating through text is always assigned to a text markup process that shall treat the whole document. Thus the iteration will wrap-around at the end of the document and it will not end before all paragraphs have been marked as "checked" for the particular markup process (like grammar checking). Paragraphs marked as "checked" will be skipped in the iteration. So for clients of an FPIterator it's simple to use them: ask it for new FP objects until none is returned and don't care about how it's implemented.

The second part is a grammar checker. A grammar checker is a component implementing the interface com.sun.star.linguistic2.XGrammarChecker. For each language there may be a particular component that is able to check for grammar errors in this language. The configuration will tell which component is responsible for what language. The implementation of com.sun.star.linguistic2.XGrammarChecker representing a particular component will encapsulate the "private" API of this grammar checking component. This private API can be UNO based or pure Java, a CLI or COM interface, a C API etc., everything that can be used or bridged to inside an implementation of a UNO interface. As the interface is pretty small it should be not very complicated to wrap existing grammar checkers for using them in OpenOffice.org.

In the middle lies the third component, that mediates between the other two. It implements the "logic" of the grammar checking process. As it talks to the other two parts by their defined UNO API only this middle part is independent from the particular document type or grammar checking component. A UNO service called com.sun.star.linguistic2.GrammarCheckingIterator is the component that actually carries out the grammar checking process for all supported scenarios. It is a singleton that controls all running grammar checking processes and thus also knows all existing grammar checking components. It implements the interface com.sun.star.linguistic2.XGrammarCheckingIterator and also provides an object implementing com.sun.star.linguistic2.XGrammarCheckingResultListener. In the following this object will be called the GCIterator.

LATEST WORD & EXCEL TEMPLATES