CS 200: P5: Hashing
Purpose
The object of this assignment is to gain experience in hashing.
ICD-10
Medical people use ICD-10
codes to document illnesses and injuries. For example:
- E7212: Methylenetetrahydrofolate reductase deficiency
- F1210: Cannabis abuse, uncomplicated
- J200: Acute bronchitis due to Mycoplasma pneumoniae
- S63014S: Dislocation of distal radioulnar joint of right wrist, sequela
- W300XXA: Contact with combine harvester, initial encounter
- Y93J1: Activity, piano playing
File Format
A sample file is available:
icd10cm_order_2016.txt
Lines in the file are of variable length, but never longer than 400 characters.
Fields are defined as follows, where the first column is column 1:
Column | Length | Contents
|
---|
1 | 5 | Order number
|
7 | 7 | Code
|
15 | 1 | 0/1 for non-header/header
|
17 | 60 | Short description
|
78 | remaining | Long description
|
You may assume that lines in the data file are in that format.
Arguments
This program will take the following arguments:
- ICD-10 data file
- hash table size
- any number (at least one) of code queries
Each query corresponds to a “Code” from the above table.
For each query, the program should print the corresponding
long description, or complain if the code was not found.
Beware of confusion between “Code”, in terms of a field in the
ICD-10 data file, and hash codes.
Implementation
You may not simply read the file multiple times,
looking for the desired queries. Instead, you must, intitially,
read the entire file into a hash table.
All queries must use that hash table, not the file.
Hash Table Format
- The hash table length is the second program argument.
- Hash the query string to get the index into the hash table.
- Each entry in the hash table is a BST.
- Each BST node contains a hash code and a long description.
- The BST is sorted by the hash code, not
the query string.
Hash Code
To hash a query string, use this CRC32 algorithm:
import java.util.zip.CRC32;
import java.util.zip.Checksum;
// Compute CRC-32 checksum
public static long cksum(String s) {
Checksum engine = new CRC32();
byte[] bytes = s.getBytes();
engine.update(bytes, 0, bytes.length);
return engine.getValue();
}
This will give you a long
value 0‥4294967295. You will
have to process this further to get an int
suitable
for indexing into your hash table.
Sample Run
Here is a sample run, where “%
” is my shell prompt:
% javac icd.java
% java icd icd10cm_order_2016.txt 10000 B2681 V00112D Zulu W6161
B2681: Mumps hepatitis
V00112D: In-line roller-skater colliding with stationary object, subsequent encounter
Zulu: code not found
W6161: Bitten by duck
Requirements
- Your main program must be in
icd.java
.
- You may add other
*.java
files, as needed.
- Emit an error message to standard error for bad arguments,
unreadable file, etc.
- We will test your program with files with the same format
as the sample data file, but they may have different names,
or contain a different number of lines.
- You may, if you wish, use
BST.jar
from
recitation, as posted to Piazza.
- You may not use the built-in Java hash table,
or any built-in Java tree. You must build your own.
Submitting your code
Use the Checkin webserver to exercise and submit your code.
Submit a P5.jar
file containing your *.java
files.
Do not put .class
files in P5.jar
.
Do not put .class
files in P5.jar
.
Do not put .class
files in P5.jar
.
Do not put .class
files in P5.jar
.
Do not put .class
files in P5.jar
.