1.  contents of the dictionary word file 
2.  status update
3.  ancient history
4.  stats




1.  CONTENTS OF THE DICTIONARY WORD FILE
---------------------------------------
/*
* Copyright 2009 by Joseph Speigle 
*
* These korean word lists are not to be redistributed for any reason, as part of a package
* or as a post on a blog or website about Korean
* these are intended only for non-profit, personal study
* 
* The above are the terms of use, which you agree to when you download the file.  
* 
* These files are not open source and all rights are reserved by me, to me.
*
* If you would like to use them for a special use, contact me at webmaster contact ฟฟฟ 
* Joseph Speigle
* 
* http://ezcorean.com
*
*/

--
-- PostgreSQL database dump
--

SET client_encoding = 'UTF8';
SET check_function_bodies = false;
SET client_min_messages = warning;

SET search_path = modpgwebuser, pg_catalog;

SET default_tablespace = '';

SET default_with_oids = false;

--
-- Name: korean_english; Type: TABLE; Schema: modpgwebuser; Owner: postgres; Tablespace: 
--

CREATE TABLE korean_english (
    wordid integer DEFAULT nextval('korean_english_wordid_seq'::regclass),
    word character varying(130),
    syn character varying(190),
    def text,
    posn integer,
    pos character varying(13),
    submitter character varying(25),
    doe timestamp without time zone,
    wordsize smallint,
    hanja character varying
);


ALTER TABLE modpgwebuser.korean_english OWNER TO postgres;

--
-- Data for Name: korean_english; Type: TABLE DATA; Schema: modpgwebuser; Owner: postgres
--

INSERT INTO korean_english VALUES (7981, '?˜๊ธ‹', '', 'asquint', 1, '1', 'engdic', '2006-01-16 00:52:46', 6, NULL);
INSERT INTO korean_english VALUES (10420, '?ค์†', '', 'bail', 1, '1', 'engdic', '2006-01-16 00:52:46', 6, NULL);
INSERT INTO korean_english VALUES (10679, '๋ฐ”ํ–ฅ?˜์?', '', 'balm', 1, '1', 'engdic', '2006-01-16 00:52:46', 12, NULL);
INSERT INTO korean_english VALUES (14383, '๋น„ํŠธ', '', 'bit', 1, '1', 'engdic', '2006-01-16 00:52:46', 6, NULL);
INSERT INTO korean_english VALUES (14692, '์นผ๋‚ ', '', 'blade', 1, '1', 'engdic', '2006-01-16 00:52:46', 6, NULL);
INSERT INTO korean_english VALUES (17793, '?‘ํ…Œ', '', 'brim', 1, '1', 'engdic', '2006-01-16 00:52:46', 6, NULL);


2.  STATUS UPDATE
---------------------------------
It doesn't have all the advanced words yet, but search results have
significantly become simplified and more pertinant.

You will also need the 6000 most popular korean words which are in a separate
file.  The linking is done by putting 'see 6000' in the space for the
definition
in the korean_english table

there are other oddities as the hanja, for example, are cross-linked to hanja
entries, because this dictionary is adapted for web use , see  http://ezcorean.com

for detailed stats, see below.

3.  ANCIENT HISTORY
-----------------------------------------

1) reversing engdic with a perl script

engdic is a hard file to clean up.  

http://www.sirfsup.com/code/perl/bin_perl/engdic_perl/engdic.perl

Engdic has many religious terms, shipping terms, old slang of about 40 or 50 years ago. 
It also contains many typos.
Many of the korean words are outdated.
Many of the English words are abbreviated to make the dictionary searchable, which 
is bad for two reasons: (1) the Korean explanation/definition of the word is rather long meaning there 
isn't a 1:1 relationship here, and (2) the English word really doesn't contain all the info in the 
Korean description so it has to be flushed out, so that when the Korean is turned into an example 
(it's not a word for a dictionary) the English needs to be changed and made longer

that project also lead to the mysql-postgres migration script

http://pgfoundry.org/projects/mysql2pgsql/

2) I redefined the 6000 most popular words and scrapped the voluminous engdic definitions
for them, and then rebuilt the definitions from scratch.  There were too many to handle.

read the blurb at

http://ezcorean.com/korean_vocabulary/common_korean_vocabulary

3) deleted all entries with more than 6 (or some number) spaces in them

4) added hanja from the hanja list for the openoffice file on linux

http://www.sirfsup.com/code/php/bin_php/hanja_openoffice.phps

5) Most of the heavy lifting
was done by the scripts of (1) and (4) and previous work.  Now, most work must 
be done by a slow, painful word-by-word manual inspection basis.
So I have slowly been cleaning up the dictionary entries 
one-by-one using the HTML forms on http://ezcorean.com

(searching for words in the korean-english dictionary and cleaning up the results)

For example by using the edit screen independent of the search results screen, to
edit individual words, fix hanja, etc.

Now, also, ezcorean 'Kengdic admin mode' includes forms for mass editing, deleting, 
changing of definitions, merging, etc.  Because of its power, only I've used it up 
till now.   


4.  STATS
=========
========== 2005 ??  ==================================

When I first started there were over 220,000 entries.

========== June 3rd, 2009  ==================================

how many words? 

mod=# select count(*) from korean_english ;
 count  
--------
 143795
(1 row)

mod=# 

I've deleted, combined, added a few, and changed 7,168 into examples by editing search
results alone .... since ... January 2009.   to weedle it down to 143,795.

The following gives the count of words which have more than one hanja, as a
result of adding hanjas to the dictionary.  (if multiple hanjas for one word,
I simply put them all in the hanja field to be sorted out by hand later, er,
that is now )

mod=# select count(*) from korean_english where hanja like '%,%';
 count 
-------
 16416
(1 row)

mod=# 

well, how many have clean single hanja entries?

mod=# select count(*) from korean_english where hanja NOT like '%,%';
 count 
-------
 24343
(1 row)

mod=# 


how many words have duplicate definitions, a price to pay for revrsing
engdic??

mod=# select  word , count(*) from korean_english group by (word) having
count(*) > 1 ;

......


(10581 rows)

but this number is really 6376 because many words have two definitions because
there are two definitions (after editing, as well).  

People have no idea how small this number is ~~ 



And how many entries need to be made into examples instead?  (words in the
dictionary don't contain spaces, those are examples)

mod=# select count(*) from korean_english where word like '% %';
 count 
-------
 50180
(1 row)

mod=#