I’m just going to leave this here.  When moving to Amsterdam, I put together a quick scraping and spidering script to get a list of the most frequently used words in Dutch to practice learning Dutch and building my vocabulary.  The thinking being that by using the most high frequent words, I would learn the language in a demand driven fashion – and learn what matters first!  It’s got a few bells and whistles like Google translation links and context examples.  I figure I’d share it, since it comes up once in a while.

The Perl code as a Github gist.

Here’s the top 100 words in Dutch.  Run the script to get a larger sample.

I want to develop it further by creating a flash-card generator and make it auto-generate quizzes, since both those methods are known to be good for language learning. Also it would be nice to have it remember the words you’ve seen and learnt.

Also, disclaimer: I’ve lived over 3 years in the Netherlands and my Dutch is awful, so this approach to language learning does so far not have a good track record.


5 thoughts on “A do-it-yourself (Dutch) language course

  1. B10m says:

    It starts of pretty good (as in “yeah, I believe this is a top 100 word”), but half way your script focuses a little too much on Bente, the girl who is “crowned” “UNOX babe” (where UNOX is a sausage brand).

    Other than that: well done! Gonna use it for my language study as well 🙂

  2. admin says:

    Haha! I noticed the focus on the UNOX babe as well. I hope that if you try the script yourself, and increases the number of pages it scrapes from 100 to 1000 or more, oddities like that would disappear.

  3. Interesting idea! You’ll understand I can always appreciate a data-driven approach to learning. 😉

    My father happens to be active in language acquisition research (focus on Dutch as a second language), and I was curious to know how well your findings would resemble some of his. So, I grabbed the list of words (just the words, nothing else) from your site, using js console.

    (It’s a bit of a hack; but seriously, you could’ve made that easier, man! :-p)

    // Dump the following lines into the JS console.
    var script = document.createElement('script');
    script.src = 'https://ajax.googleapis.com/ajax/libs/jquery/1.3/jquery.min.js';
    document.body.insertBefore( script, document.body.firstChild );
    $("dt .title").each(function(i) { console.log($(this).text().match(/^(\w+)\W.+$/)[1]) })
    // Just the words should now be printed in the console in all-caps.

    Then, on Anne’s website I input the list of words into the MLR tool, which will analyse a given text and return an indication of the frequency of the words and concepts used as compared to a base sample. The data behind this was collected in Dutch classrooms (think teachers with tape recorders and lots and lots of typists). It should give you an idea of the words that Dutch kids (and their teachers) used in school at the time and their relation to the size of their vocabulary.

    Results seem interesting, but not very surprising.

    – Almost three quarters of the words you’ve found fall within “lijst 1”. These are indeed extremely common words.
    – MLR estimates you’d need a vocabulary of about 12.000 words to understand the whole “text” (note you do not need to know all the words to understand a whole body of text; there’s some mathemagic going on there to estimate that number).
    – Sixty words are completely unknown (mostly obvious things like “pp”, “kardashian” and “appstore”). These seem to be either loose letters or relatively new words.
    – The most difficult words (from “lijst 9”) are “aanmelden” (to register; on a site, I presume), “atleet” (athlete; probably because of the recent olympics), “ban_N” (spellbound; very uncommon word which is presumably frequent in your set because of the Dutch title for the Lord of the Rings), “kim” (goes with “kardashian”, I guess) and “overleg” (consultation). All of these seem skewed towards the interwebs to me; this was to be expected.

    I am only skimming and massively oversimplifying the results here. You can run the analysis on the site yourself and read the paper linked on it if you want more details. Both site and paper are in Dutch though, so a bit of a catch 22 for you there. 😉

    Veel leesplezier!

  4. Rafael Garcia-Suarez says:

    Surprised to find Rafael in that list 🙂 you probably need bigger samples.

  5. admin says:

    That’s a seriously thorough comment, Lukas 🙂 Quite interesting is that a lot of the words that stands out to you all (Bente the UNOX babe, Kim Kardashian and “Rafael”) appears mainly because I seeded this list with Yahoo.nl in addition to Wikipedia, and limited to download 250 pages. In my own version I actually use, I used only nrc.nl and wikipedia and seeded it with 1000 pages. That has a lot less of these anomalies! I had to change it because nrc.nl has added some javascript stuff since and didn’t take well to my scraping. Clearly Yahoo was not a great replacement!

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.