发布日期:2021-07-04 04:11:23 浏览次数:
1. You will write a Noun Group tagger, using similar data that you used for HW 4. However, for this
program we will focus more on feature selection than on an algorithm.
1. Download the WSJ_CHUNKFILES.zip from NYUClasses (Resources). This includes the following
data files
1. WSJ_02-21.pos-chunk -- the training file
2. WSJ_24.pos -- the development file that you will test your system on
3. WSJ_24.pos-chunk -- the answer key to test your system output against
4. WSJ_23.pos -- the test file, to run your final system on, producting system output
2. Download MAX_ENT_files.zip, also from NYUClasses resources. This includes the following
program files:
1. maxent-3.0.0.jar, MEtag.java. MEtrain. java and trove.jar -- Java files for running the maxent
training and classification programs
2. score.chunk.py -- A python scoring script
3. Create a program that takes a file like WSJ_02-21.pos-chunk as input and produces a file consisting
of feature value pairs for use with the maxent trainer and classifier. As this step represents the bulk
of the assignment, there will be more details below, including the format information, etc. This
program should create two output files. From the training corpus (WSJ_02-21.pos-chunk), create a
training feature file (training.feature). From the development corpus (WSJ_24.pos), create a test
feature file (test.feature). See details below.
4. Compile and run MEtrain.java, giving it the feature-enhanced training file as input; it will produce
a MaxEnt model. MEtrain and MEtest use the maxent and trove packages, so you must include the
corresponding jar files, maxent-3.0.0.jar and trove.jar, on the classpath when you compile and
run. Assuming all java files are in the same directory, the following command-line commands will
compile and run these programs -- these commands are slightly different for posix systems (Linux or
Apple), than for Microsoft Windows.
1. For Linux, Apple and other Posix systems, do:
1. javac -cp maxent-3.0.0.jar:trove.jar *.java ### compiling
2. java -cp .:maxent-3.0.0.jar:trove.jar MEtrain training.feature model.chunk ### creating
the model of the training data
3. java -cp .:maxent-3.0.0.jar:trove.jar MEtag test.feature model.chunk response.chunk
### creating the system output
2. For Windows Only -- Use semicolons instead of colons in each of the above commands, i.e.,
the command for Windows would be:
1. javac -cp maxent-3.0.0.jar;trove.jar *.java ### compiling
2. java -cp .:maxent-3.0.0.jar;trove.jar MEtrain training.chunk model.chunk ### creating
the model of the training data
3. java -cp .:maxent-3.0.0.jar;trove.jar MEtag test.feature model.chunk response.chunk
### creating the system output
3. Quick Fixes
If the system is running out of memory , you can specify how much RAM java uses.
For example, java -Xmx16g -cp ... will use 16 gigabytes of RAM.
If your system cannot find java files or packages and just doesn't run for that reason,
the easiest fix is to run (the java steps) on one of NYU's linux servers. Accounts can be
made available to all students in this class. Alternatively, you can make sure that all path variables are set properly, that java is properly installed, etc.
1. There should be 1 corresponding line of features for each line in the input file (training or test)
If the input and feature files have different numbers of lines, you have a bug
2. Blank lines in the input file should correspond to blank lines in your feature file
3. Each line corresponding to text should contain tab separated values as follows:
1. the first field should be the token (word, puncutation, etc.)
2. this should be followed by as many features as you want (but no feature should contain
white space). Typically, features are recommended to have the form attribute=value,
e.g., POS=NN
This makes the features easy for humans to understand, but is not actually
required by the program, e.g., the code does not look for the = sign.
3. for the training file only, the last field should be the BIO tag (B-NP, I-NP or O)
4. for the test file, there should be no final BIO field (as there is none in the .pos file that
you would be training from)
5. A sample training file line (where \t represents tab):
'fish\tPOS=NN\tprevious_POS=DT\tprevious_word=the\tI-NP ## actual lines
will probably be longer
4. There is a special symbol '@@' that you can use to refer to the previous BIO tag, e.g.,
Previous_BIO=@@
This allows you to simulate an MEMM because you can refer to the previous BIO tag
2. Suggested features:
1. Features of the word itself: POS, the word itself, stemmed version of the word
2. Similar features of previous and/or following words (suggestion: use the features of previous
word, 2 words back, following word, 2 words forward)
3. Beginning/Ending Sentence (at the beginning of the sentence, omit features of 1 and 2 words
back; at end of sentence, omit features of 1 and 2 words forward)
截屏,微信识别二维码
微信号:EG1hao
(点击微信号复制,添加好友)