GATE – General Architecture for Text Engineering

GATE – General Architecture for Text Engineering

Introduction

GATE is an open source software written in JAVA and mainly used to extract the information from the unstructured data. GATE components mainly support the following capabilities,

  • Natural language processing (NLP).
  • Information extraction in many languages.

GATE components:

GATE mainly acts on the textual resources namely word document, text document, HTML, PDF and XML. It provides the different types of built in plugins. It’s easy to start using the GATE from the GATE developer GUI. The following are the GATE components.

  • Language Resources (LRs) – Corpus (set of documents), document, annotations
  • Processing Resources (PRs) – ANNIE ( GATE plugins)
  • Application – Combination of both Language resources and processing resources. Sequence of processing resources is applied on the language resources. It is also called as Visual resources.

Use case

Let’s have a use case using GATE. The use case is how to use the GATE developer GUI and how to create a sample GATE application to make use of the GATE plug-ins to extract the meaningful information from the unstructured content.

Pre – requisite:

  • Install JDK 1.6+
  • Install GATE software (GATE 7.1) – http://gate.ac.uk/download/

Solution:

To handle this use case the following plugins are used to show the functionality of the GATE system.

ANNIE plugin:  This plugin supports the following functionalities.

  • Default tokenzier -  tokenzie the sentence
  • Sentence splitter – Splits the sentence based on the punctuation.
  • Gazetter – lookup

 Number Tagger Plugin:

  • This plugin is used to find the numbers in both numeric and digits, it annotates them with their numeric values.

GATE Developer GUI:

Launch the GATE application from the installation directory.

STEPS:

  • Select the File tab menu. Click on the “Manage plugins”. This is used to load the required plugins. Load the ANNIE plugin, JAPE plus Transducer plugin and  Tagger_Number plugin.

pciture1

  • Select the language resources and create the GATE document.
  • Select the processing resources and load the ANNIE sentence splitter and english tokenzier. Now load the Number tagger processing resource.

picture2

  • Select the application option and create sample pipeline application. Add the processing resources, then select the language resources and run the application.

picture3
Annotation sets and annotation list highlights the output from the plugins. Number tagger plugins creates the “Number annotation” feature. It is used to annotate the number in both words and numbers in numeric values.

GATE Embedded Sample:

Create sample java project as given below specified in the screenshot.

package explorerGATE  jars:
Copy the gate.jar from the bin directory (${GATE_HOME}\bin) and copy NumbersTagger jar from the ${GATE_HOME}\plugins\Tagger_Numbers respectively. The other supporting jars will be available in the lib folder of the GATE installation directory (${GATE_HOME}\lib).

  • Steps:

Initialize the GATE system. It will be properly achieved by creating the GATE_HOME environment variable. Create the system variable GATE_HOME and refer to the GATE installation directory.

 Challenges:

  • The GATE embedded application will expect the GATE home should be initialized. If the GATE HOME is not initialized it will throw the following exception.

Gate main

  • If the gate-compiler-jdt.jar is not included in the project it will throw the following exception.

gate compiler

Output:

output final

Conclusion

GATE software supports lot of language processing functionality. It is very useful to process the unstructured content and retrieval of useful information. The above application is example to show the GATE functionality in terms of built in plugins. By using the GATE software we can create our own custom plugins based on our own requirements.

References

3838 Views 3 Views Today
  • Bhagesh Arora

    Hi
    I am just trying to steps, which you mentioned above. But I am getting some Exception please review it & try to give me a appropriate solution.

    Initializing the GATE ………………..
    log4j:WARN No appenders could be found for logger (gate.Gate).
    log4j:WARN Please initialize the log4j system properly.
    log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
    gate.util.GateRuntimeException: Could not infer installed plug-ins home!
    Please set it manually using the -Dgate.plugins.home option in your start-up script.
    at gate.Gate.initLocalPaths(Gate.java:303)
    at gate.Gate.init(Gate.java:163)
    at com.example.gate.main.GateMain.initializeGate(GateMain.java:113)
    at com.example.gate.main.GateMain.main(GateMain.java:34)
    Reading the input files and adding the document to the corpus ……………
    Input……………….one.
    fourteen.
    one hundred.
    three thousand four hundred.
    3 millions.
    400 billions.

    controller…null
    java.lang.NullPointerException
    at com.example.gate.main.GateMain.process(GateMain.java:50)
    at com.example.gate.main.GateMain.main(GateMain.java:35)

    • http://www.treselle.com/ Treselle Systems Blog

      Can you properly initiate GATE_HOME and check it?