Reverse engineering visual novels 101

I’d like to make a confession: I love visual novels. For those of you who aren’t in the trend, visual novels are something inbetween of interactive books, games-that-mostly-consist-of-reading-lotsa-text, and radio plays with images. Needless to say that the vast majority of them comes from Japan. I could say that for the last 5 years I’m totally into it, ignoring most other forms of entertainment media, like paper books, audio books or TV.

Since the early childhood I’ve been an avid tinkerer: I just love to get inside things and figure out how they work. Such an attitude slowly got me a reverse engineer and malware analyst position in a well-known national company, so I thought: why don’t I combine these two hobbies and see what would happen?

I had these thoughts for several years already, but everything changed when I encountered Kaitai Struct framework this spring. It’s a new framework for binary structures reverse engineering (although the authors seem to insist that it’s purely for peaceful purposes :)) The basic idea is simple, yet catchy: you use a special markup language to declaratively mark up binary data with a structure — and, voila, you compile it with a special compiler and you’ve got yourself a parsing library in any of supported programming languages. There is also a handy visualizer / hex viewer available. Shall we give it a spin?

To make it practical and demonstrate typical stuff one encounters while doing basic reverse engineering work, I propose that we start with something not-so-trivial, namely a beautiful visual novel called Koisuru Shimai no Rokujuso (originally 恋する姉妹の六重奏) by PeasSoft. It’s a fun, easy-going romantic comedy with a splendid visual part that’s PeasSoft is usually famous for. As a reverse engineering object, it’s also nice thing to explore, as it looks like we’ll be kind of breaking fresh ground. And it’s much more interesting that just duplicating work of others on well-explored stuff like Kirikiri or Ren’Py.

Pre-flight check

First of all, let’s take a look at the list of stuff we’ll need:

Kaitai Struct — compiler, visualizer, and runtime library for your favorite programming language
Java runtime — unfortunately, Kaitai Struct is written in Scala and thus requires JRE to run
Ruby — Kaitai Struct visualizer is written in Ruby
any of languages supported by KS (at time of writing, this is C++, C#, Java, JavaScript, Ruby and Python) — I’ll be using Ruby myself, but it’s not that important, as we’ll be literally using it to write several lines of code at the conclusion of our exploration
some kind of a hex editor; it’s not important which one, they all suck :) I use Okteta, but fell free to use anything you like, as long as it can do the basic stuff like viewing hex dump, jumping to an address and quickly copying marked up fragment into a file

Getting ‘em ready

There we go, now we need an object of our dissection, which is a visual novel distribution. We’ve got lucky here, PeasSoft legally offers trial versions of their visual novels for free download at their site. Trial version is more than enough for us to get acquantied with their formats. Seek for the page that looks like this at their site:

Trial version download page

That blue list with numbers on the left is actually a pack of mirror links — you can use any one them to download the distro.

Think-think-think

Having downloaded it, let’s start with the basic intelligence. What do we know already about our target? At least the version we’ve downloaded runs on Windows OS using Intel CPUs (actually there *is* an Android version that runs on ARMs, but it’s sold only at Japanese app markets, and it’s not so easy to extract from a phone). So, what can we assume given that it’s Windows/Intel?

integer numbers in binary formats will be using little-endian encoding
Windows programmers still live in a stone age and use Shift-JIS encoding (and not UTF-8 which rest of the world uses)
end-of-line markers, if we’ll ever encounter them, will be using “\r\n”, and not just “\n”

Quick overview of our loot shows us the following:

data01.ykc — 8,393,294
data02.ykc — 560,418,878
data03.ykc — 219,792,804
sextet.exe — 978,944
AVI/op.mpg — 122,152,964

So, here we have a single .exe file (obviously, an executable engine), a few enormous .ykc files (most likely —content archives or containers) and op.mpg —an opening video (one can easily open it with any video player).

It could come useful to do a quick inspection of exe file in a hex editor. Most modern developers are sane enough and have finally switched to use popular ready-made libraries for image / music / sound processing, instead of inventing their own in-house compression formats. And all these libraries usually bear some signatures which can be easily spotted by a naked eye. Things to look for:

“libpng version 1.0.8 — July 24, 2000” — libpng usage, it means that images will be in .png format
“zlib version error”, “Unknown zlib error” — zlib markers, which mean that zlib compression will be employed; actually, it can be a false positive, as zlib compression is embedded in png library, so it might be just the part of libpng
“Xiph.Org libVorbis I 20020717” — libvorbis, it means that music / sounds / voices might use ogg/vorbis format
“Corrupt JPEG data”, “Premature end of JPEG file” —a few strings from libjpeg; if they’re here — it means that the chances are that the engine can work with JPEG images too
“D3DX8 Shader Assembler Version 0.91” — something inside the engine uses D3DX8 shaders
lots of strings like “Microsoft Visual C++ Runtime Library”, “`local vftable’”, “`eh vector constructor iterator’” and so on reveal that this exe is linked with Microsoft C++ library, thus it was originally written in C++; actually, one can even derive an exact version of compiler (and a library), but it would be of little use for us now — we’re not going to disassemble exe or anything, we’re still attempting to do a fair “clean room” job

It’s also a good idea to look for stuff like “version”, “copyright”, “compiler”, “engine”, “script” — and, given that it’s a Windows exe, don’t forget to look for it in 2-byte encodings like UTF16-LE — chances are that you’ll end up with something interesting. In our case, we’ve got “Yuka Compiler Error”, “YukaWindowClass”, “YukaApplicationWindowClass”, “YukaSystemRunning”, and mentions of “start.yks” and “system.ykg”. It would be a fair bet to say that developers named their engine “Yuka” and all the files related to it have extensions which start with “yk” — “ykc”, “yks”, “ykg”. Also we can spot “CDPlayMode” and “CDPlayTime” — probably it means that the engine can play music tracks from Audio CDs, and “MIDIStop” with “MIDIPlay” suggest that there’s support for MIDI music too.

Various fancy signature strings inside .exe give us some bold hints

To sum it all up, it means that:

images would be probably in .png and .jpeg formats, which is actually a very good news — we won’t need to mess with custom compression or anything
music, sounds and voices probably would use .ogg files (but may also use MIDI or CDDA, though unlikely — trail version is a file download, not a physical CD)

Let’s get dangerous

Ok, time to roll up our sleeves and get our hands dirty. There are several files — that is actually very good news too. If you do reverse engineering, having several specimens could be a very valuable asset, as you can fully employ statistical methods, comparing these specimens to each other. It’s much harder to guess what that 7F 02 00 00 might mean if that’s the only thing you have.

Let’s check if these files all have the same format. Judging from that they all follow the same “data*.ykc” pattern — they do. Checking beginnings of the files reveals that they all start with “YKC001\0\0” — which proves us right.

Another quick check: let’s see if they are compressed. Just take any compressor and try to compress one file, then check compression ratio. I’ve just used zip:

before — 8,393,294 bytes
after— 6,758,313 bytes

Yeah, it applies some compression, but not too much. Chances are that it’s uncompressed, or at least some files would be uncompressed. Actually, if there are .pngs or .oggs in that archive, they are already well-compressed, so it looks like the truth.

Basic theory suggests that every archive would have some sort of header, something an application would start reading its archive with, and 99% of file would be filled with archive contents — either files or some blocks of data. Note the header is not always at the very beginning of the file — it might be at some offset at the end of file, or at some offset from the beginning. It’s highly unlikely, though, that the header would be right in the middle of contents.

Checking the beginning of the files reveals:

There we go:

59 4B 43 30 │ 30 31 00 00 — it’s just a signature, some magic string to check that the file is actually a container; it’s “YKC001” in ASCII — chances are that it means something like “Yuka Container”, version 001
18 00 00 00 │ 00 00 00 00 │ 1A 00 80 00 │ 34 12 00 00 — that’s the header
and then we see the body of some kind of the file; it’s very easy to see in data02.ykc — archive starts with a text file, something like a config or some script language, sporting lines like “WindowSize = 1280, 800”, “TransitionWaitType = 2”, etc

Let’s compare the headers of all three files we have:

data01.ykc: 18 00 00 00 │ 00 00 00 00 │ 1A 00 80 00 │ 34 12 00 00data02.ykc: 18 00 00 00 │ 00 00 00 00 │ 4E E1 66 21 │ F0 6E 00 00data03.ykc: 18 00 00 00 │ 00 00 00 00 │ 5C E1 18 0D │ 48 E4 00 00

I don’t know what “18 00 00 00 │ 00 00 00 00” is yet, but it doesn’t really matter as it’s all the same everywhere. The other two fields are much more interesting — they look a lot like two 4-byte integers. So, it’s time to grab onto Kaitai Struct and start hacking away a rough sketch of our format:

meta:id: ykcapplication: Yuka Engineendian: leseq:

id: magiccontents: ["YKC001", 0, 0]
id: magic2contents: [0x18, 0, 0, 0, 0, 0, 0, 0]
id: unknown1type: u4
id: unknown2type: u4

Nothing too that scary, isn’t it? Actually, .ksy is just a plain YAML files. That “meta” section is made up of our early findings — i.e. that we’re working on “ykc” files, the application that it uses is called “Yuka Engine” (it’s just a friendly reminder for fellow researchers, like a comment — it doesn’t really affect parsing) and we’ll be using little endian integers by default (that is “endian: le”).

Then we have a “seq” section which describes how to parse the file — that’s an list of fields. Each field must have a name (“id”) — and that’s exactly we’re working for — and description of its contents. We were using 2 clauses here:

“contents: [0x18, 0, 0, 0, 0, 0, 0, 0]” marks up a field with the fixed contents. It actually means that during the parsing KS would automatically read the right number of bytes and compare it with given pattern. If they won’t match, an exception would be thrown.
“type: u4” marks up “u”nsigned integer, 4 bytes long field. It would use default endianness, which we’ve specified in “meta”.

Let’s try our ksy file in a visualizer:

ksv data01.ykc ykc.ksy

and, lo and behold, our data01.ykc laid out nicely:

Kaitai Struct visualizer in all its console glory

Yeah, the visualizer is console based, so you can start feeling like a Real Hacker they show in the movies right about now. But let’s take a look at the tree structure first:

[-] [root][.] @magic = 59 4b 43 30 30 31 00 00[.] @magic2 = 18 00 00 00 00 00 00 00[.] @unknown1 = 8388634[.] @unknown2 = 4660

One can use arrow keys in the visualizer to walk through all these fields, use “Tab” to jump to hex viewer and back and use “Enter” to open closed tree nodes, show the instances (we’ll talk about them later) and view the hex dumps full screen. To make article easier to read, I won’t be showing full screenshots of the visualizer, only the interesting parts of the tree as text.

Ok, that was data01.ykc, let’s check out data02.ykc:

[-] [root][.] @magic = 59 4b 43 30 30 31 00 00[.] @magic2 = 18 00 00 00 00 00 00 00[.] @unknown1 = 560390478[.] @unknown2 = 28400

and data03.ykc:

[-] [root][.] @magic = 59 4b 43 30 30 31 00 00[.] @magic2 = 18 00 00 00 00 00 00 00[.] @unknown1 = 219734364[.] @unknown2 = 58440

Not that much, actually. There’s not a file directory or anything. Let’s note the sizes of original container files and see if any of these could be offsets or pointers inside the file:

data01.ykc — 8393294 @unknown1 = 8388634
data02.ykc — 560418878 @unknown1 = 560390478
data03.ykc — 219792804 @unknown1 = 219734364

Wow, looks like we’ve hit the bullseye. Let’s check out what’s happening at that offset in the file:

meta:id: ykcapplication: Yuka Engineendian: leseq:

id: magiccontents: ["YKC001", 0, 0]
id: magic2contents: [0x18, 0, 0, 0, 0, 0, 0, 0]
id: unknown_ofstype: u4
id: unknown2type: u4instances:unknown3:pos: unknown_ofssize-eos: true

We’ve added another field named “unknown3”. However, this time it’s not in “seq” section, but in “instances” section. It’s actually the very same thing, but “instances” is used to describe fields which are not going in sequence, thus they can be anywhere in the stream and require note of position (“pos”) to start parsing from. So, our “unknown3” starts with “unknown_ofs” (“pos: unknown_ofs”) and spans up to the end of file (=stream, so “size-eos: true”). As we have no idea what we’ll get there, so far it will be read just as a byte stream. Nothing too fancy, but let’s take a look:

[-] [root][.] @magic = 59 4b 43 30 30 31 00 00[.] @magic2 = 18 00 00 00 00 00 00 00[.] @unknown_ofs = 8388634[.] @unknown2 = 4660[-] unknown3 = 57 e7 7f 00 0a 00 00 00 18 00 00 00 13 02 00 00…

Hey, note the length of that “unknown3”. It looks exactly like “unknown2”. So it turns out that the first header of YKC file is actually not the header itself, but a reference to some other point of file to find the real header. Let’s fix our .ksy file to add this knowledge:

meta:id: ykcapplication: Yuka Engineendian: leseq:

id: magiccontents: ["YKC001", 0, 0]
id: magic2contents: [0x18, 0, 0, 0, 0, 0, 0, 0]
id: header_ofstype: u4
id: header_lentype: u4instances:header:pos: header_ofssize: header_len

Nothing too complex here: we’ve just renamed “unknown” fields to have more meaningful names and replaced “size-eos: true” (which means reading everything up to the end of file) with “size: header_len” (which specifies exact amount of bytes to read). Probably that’s pretty close to original idea. Load it up once again and now let’s focus on that field we’ve named “header”. It looks something like that in data01.ykc:

000000: 57 e7 7f 00 0a 00 00 00 18 00 00 00 13 02 00 00000010: 00 00 00 00 61 e7 7f 00 0b 00 00 00 2b 02 00 00000020: db 2a 00 00 00 00 00 00 6c e7 7f 00 11 00 00 00000030: 06 2d 00 00 92 16 00 00 00 00 00 00 7d e7 7f 00

in data02.ykc:

000000: d1 2b 66 21 0c 00 00 00 18 00 00 00 5a 04 00 00000010: 00 00 00 00 dd 2b 66 21 14 00 00 00 72 04 00 00000020: 26 1a 00 00 00 00 00 00 f1 2b 66 21 16 00 00 00000030: 98 1e 00 00 a8 32 00 00 00 00 00 00 07 2c 66 21

in data03.ykc:

000000: ec 30 17 0d 26 00 00 00 18 00 00 00 48 fd 00 00000010: 00 00 00 00 12 31 17 0d 26 00 00 00 60 fd 00 00000020: 0d 82 03 00 00 00 00 00 38 31 17 0d 26 00 00 00000030: 6d 7f 04 00 d0 85 01 00 00 00 00 00 5e 31 17 0d

At the first glance, it doesn’t make any sense at all. On the second thought, though, a sequence of repeating bytes catches our eye. That is `e7 7f` in the first file, `2b 66` in the second, and `30 17` with `31 17` in the third. Actually, it looks very much like that we’re dealing with fixed length records here, 0x14 (20 decimal) bytes long. By the way, this hypothesis goes well with header lengths in all three files too: all three of 4660, 28400, and 58440 are divisible by 20. Let’s give it a try:

meta:id: ykcapplication: Yuka Engineendian: leseq:

id: magiccontents: ["YKC001", 0, 0]
id: magic2contents: [0x18, 0, 0, 0, 0, 0, 0, 0]
id: header_ofstype: u4
id: header_lentype: u4instances:header:pos: header_ofssize: header_lentype: headertypes:header:seq:
- id: entriessize: 0x14repeat: eos

Check out what happened with “header” instance here. It’s still positioned at “header_ofs” and has size of “header_len” bytes, but it’s no longer a mere byte array. It has its own type, “type: header”. This means that we can specify a custom type and this type will be used to process the given field. Here we go, it’s right below, in that “types:” section. As you might have already guessed, actually “type” follows exactly the same format as the main file (i.e. root) — one can use the same “seq” section to specify the sequence of subfields, “instances”, it can have its own subtypes (“types”), etc, etc.

So, we’ve specified a “header” type, which consists of a single field — “entries”. We know that this field need to be 0x14 bytes long (“size: 0x14”), but we demand it to be repeated as long as possible (i.e. up to end of the stream — that is “repeat: eos”).

By the way, note that the concept of “stream” is not the same as we’ve seen before, when “stream” actually meant “whole file”. This time we’ve dealing with a substructure that has fixed size (“size: header_len”), so that repetition will be limited by that size anyway. So we can rest assured that if there would be something beyond that length, it won’t be contaminating this structure of ours.

Ok, let’s give it a try:

[-] header[-] @entries (233 = 0xe9 entries)[.] 0 = 57 e7 7f 00|0a 00 00 00|18 00 00 00|13 02 00 00|00 00 00 00[.] 1 = 61 e7 7f 00|0b 00 00 00|2b 02 00 00|db 2a 00 00|00 00 00 00[.] 2 = 6c e7 7f 00|11 00 00 00|06 2d 00 00|92 16 00 00|00 00 00 00[.] 3 = 7d e7 7f 00|14 00 00 00|98 43 00 00|69 25 00 00|00 00 00 00[.] 4 = 91 e7 7f 00|15 00 00 00|01 69 00 00|d7 12 00 00|00 00 00 00[.] 5 = a6 e7 7f 00|12 00 00 00|d8 7b 00 00|27 3f 07 00|00 00 00 00

Now it starts to make some sense, isn’t it? It really looks like a repeated structure. Let’s check out the second file too:

[-] header[-] @entries (1420 = 0x58c entries)[.] 0 = d1 2b 66 21|0c 00 00 00|18 00 00 00|5a 04 00 00|00 00 00 00[.] 1 = dd 2b 66 21|14 00 00 00|72 04 00 00|26 1a 00 00|00 00 00 00[.] 2 = f1 2b 66 21|16 00 00 00|98 1e 00 00|a8 32 00 00|00 00 00 00[.] 3 = 07 2c 66 21|16 00 00 00|40 51 00 00|a2 16 00 00|00 00 00 00[.] 4 = 1d 2c 66 21|16 00 00 00|e2 67 00 00|89 c4 00 00|00 00 00 00[.] 5 = 33 2c 66 21|16 00 00 00|6b 2c 01 00|fa f5 00 00|00 00 00 00

By the way, do you see that (233 = 0xe9 entries) and (1420 = 0x58c entries)? It’s plausible to deduce that it could be number of files in the archive. Our first archive is relatively small (8 MiB), dividing it by 233 files yields us 36022 bytes per file on average. Looks legit for a bunch of scripts, configs, etc. The second archive is the largest (560 MiB), having 1420 files yields 394661 bytes per file, which looks ok for stuff like images, voice files, etc.

`57 e7 7f 00`, `61 e7 7f 00`, `6c e7 7f 00` and so on look very much like an increasing sequence of integers, what could it mean? In the second file it’s `d1 2b 66 21`, `dd 2b 66 21`, `f1 2b 66 21`. Hang on a sec here, I think I’ve seen it somewhere already. Let’s roll back to the beginning of our work — that’s it! It’s close to the full length of file — thus, it looks like offsets yet again. Ok, let’s try to describe the structure of these 20-bytes records. Judging by the looks, I’d say that these are 5 integers. We’ll describe another type named “file_entry”. Giving full listings becomes a bother, so if you’ll excuse me I won’t copy-paste whole file from now on and will just show you the changed “types” section:

types:header:seq:- id: entriesrepeat: eostype: file_entryfile_entry:seq:- id: unknown_ofstype: u4- id: unknown2type: u4- id: unknown3type: u4- id: unknown4type: u4- id: unknown5type: u4

No new .ksy features tackled here. We’ve added “type: file_entry” for entries and described this subtype as 5 sequential u4 integers. Checking it out in visualizer:

[-] header[-] @entries (233 = 0xe9 entries)[-] 0[.] @unknown_ofs = 8382295[.] @unknown2 = 10[.] @unknown3 = 24[.] @unknown4 = 531[.] @unknown5 = 0[-] 1[.] @unknown_ofs = 8382305[.] @unknown2 = 11[.] @unknown3 = 555[.] @unknown4 = 10971[.] @unknown5 = 0[-] 2[.] @unknown_ofs = 8382316[.] @unknown2 = 17[.] @unknown3 = 11526[.] @unknown4 = 5778[.] @unknown5 = 0

Any thoughts? Yet another idea: “unknown3” is a pointer to the beginning of the file in our archive, “unknown4” is most likely being length of this file. It’s simple because 24 + 531 = 555, and 555 + 10971 = 11526. That’s simply the files that go on sequentially in the container. One might also note the same for unknown_ofs and unknown2: 8382295 + 10 = 8382305, 8382305 + 11 = 8382316. That means that “unknown2” is a length of some other subrecords which begin at “unknown_ofs” offset. “unknown5” always seems to be equal to 0.

Come on, let’s add some special magic into “file_entry” to read these blocks of data, i.e. record at (unknown_ofs; unknown2) and file body at (unknown3; unknown4). It would look like that:

file_entry:seq:- id: unknown_ofstype: u4- id: unknown_lentype: u4- id: body_ofstype: u4- id: body_lentype: u4- id: unknown5type: u4instances:unknown:pos: unknown_ofssize: unknown_lenio: _root._iobody:pos: body_ofssize: body_lenio: _root._io

Actually, we’ve done that trick with “instances” before, so it’s not that new. The only real new “magic” thing here is that “io: _root._io” specification. What does it do?

Do you remember when I’ve mentioned that KS has concept of “stream” that’s being read, and if you effectively limit that “stream” by offset and size while parsing a substructure, it’s not the same “stream” that equals to whole file we had from the very beginning? That’s the case here. Without this “io” specification, “pos: body_ofs” would try to seek to “body_ofs” position in a stream that corresponds to our “file_entry” record, which is actually 20 bytes long — and that’s not what we want (not to mention that it would result in an error). So we need some special magic to specify that we want to read not from the current IO stream, but from IO stream that corresponds to the whole file — that is “_root._io”.

Ok, what have we got with all that?

\[-\] [@entries](http://twitter.com/entries "Twitter profile for @entries") (233 = 0xe9 entries)  
  \[-\]   0  
    \[.\] [@unknown\_ofs](http://twitter.com/unknown_ofs "Twitter profile for @unknown_ofs") = 8382295  
    \[.\] [@unknown\_len](http://twitter.com/unknown_len "Twitter profile for @unknown_len") = 10  
    \[.\] [@body\_ofs](http://twitter.com/body_ofs "Twitter profile for @body_ofs") = 24  
    \[.\] [@body\_len](http://twitter.com/body_len "Twitter profile for @body_len") = 531  
    \[.\] [@unknown5](http://twitter.com/unknown5 "Twitter profile for @unknown5") = 0  
    \[-\] unknown = 73 74 61 72 74 2e 79 6b 73 00   
    \[-\] body = 59 4b 53 30 30 31 01 00 30 00 00 00…  
  \[-\]   1  
    \[.\] [@unknown\_ofs](http://twitter.com/unknown_ofs "Twitter profile for @unknown_ofs") = 8382305  
    \[.\] [@unknown\_len](http://twitter.com/unknown_len "Twitter profile for @unknown_len") = 11  
    \[.\] [@body\_ofs](http://twitter.com/body_ofs "Twitter profile for @body_ofs") = 555  
    \[.\] [@body\_len](http://twitter.com/body_len "Twitter profile for @body_len") = 10971  
    \[.\] [@unknown5](http://twitter.com/unknown5 "Twitter profile for @unknown5") = 0  
    \[-\] unknown = 73 79 73 74 65 6d 2e 79 6b 67 00   
    \[-\] body = 59 4b 47 30 30 30 00 00 40 00 00 00…

It’s easier to check out with interactive visualizer, but even on this static shot it’s easy to tell that `73 74 61 72 74 2e 79 6b 73 00` is an ASCII string. Checking out string representation, it turns out that it’s “start.yks” with trailing zero byte. And `73 79 73 74 65 6d 2e 79 6b 67 00` is actually “system.ykg”. Bingo, it’s the file names. And we’re damn sure of them that they are the strings, not just some bytes. Let’s mark it up:

file_entry:seq:- id: filename_ofstype: u4- id: filename_lentype: u4- id: body_ofstype: u4- id: body_lentype: u4- id: unknown5type: u4instances:filename:pos: filename_ofssize: filename_lentype: strencoding: ASCIIio: _root._iobody:pos: body_ofssize: body_lenio: _root._io

The new stuff here is that “type: str” — it means that the bytes we’ve captured must be interpreted as a string — and “encoding: ASCII” specifies encoding (we’re not really sure, but so far it’s been ASCII). Visualizer again:

[-] header[-] @entries (233 = 0xe9 entries)[-] 0[.] @filename_ofs = 8382295[.] @filename_len = 10[.] @body_ofs = 24[.] @body_len = 531[.] @unknown5 = 0[-] filename = "start.yks\x00"[-] body = 59 4b 53 30 30 31 01 00 30 00 00 00…[-] 1[.] @filename_ofs = 8382305[.] @filename_len = 11[.] @body_ofs = 555[.] @body_len = 10971[.] @unknown5 = 0[-] filename = "system.ykg\x00"[-] body = 59 4b 47 30 30 30 00 00 40 00 00 00…[-] 2[.] @filename_ofs = 8382316[.] @filename_len = 17[.] @body_ofs = 11526[.] @body_len = 5778[.] @unknown5 = 0[-] filename = "SYSTEM\\black.PNG\x00"[-] body = 89 50 4e 47 0d 0a 1a 0a 00 00 00 0d…

Now, isn’t that nice? Looks like a job well done for me. You can even select individual file bodies, press “w” in the visualizer, type some name and export these binary blocks as local files. But this is tiresome and that’s not exactly what we wanted: we wanted to extract all the files at once, keeping their original filenames.

Showdown time

Let’s make a script for that. What do we do to transform our format description into code? Now that’s where Kaitai Struct shines: you don’t need to retype all that type specifications into code manually. You just get the ksc compiler and run:

ksc -t ruby ykc.ksy

and you’ve got yourself a nice and shiny “ykc.rb” file in your current folder, which is a library that you can plug in and use straight away. Ok, but how do we do that? Let’s start with something simple, like listing files to the screen:

require_relative 'ykc'Ykc.from_file('data01.ykc').header.entries.each { |f|puts f.filename}

Cool, huh? Here we go — two lines of code (four, if you count in “require” and block termination) — and we’ve got that huge listing pumping:

start.ykssystem.ykgSYSTEM\black.PNGSYSTEM\bt_click.oggSYSTEM\bt_select.oggSYSTEM\config.yksSYSTEM\Confirmation.yksSYSTEM\confirmation_load.pngSYSTEM\confirmation_no.ykgSYSTEM\confirmation_no_load.ykg...

Let’s go through what’s going on here step-by-step:

Ykc.from_file(…) — creates a new object of Ykc class (which is generated from our .ksy description), parsing a file from local filesystem; fields of this object would be filled with whatever’s describe in .ksy
.header — selects “header” field in Ykc, thus returning instance of Ykc::Header class, which corresponds to “header” type in .ksy
.entries — selects “entries” field in the header, returns an array of instances of Ykc::FileEntry class
.each { |f| … } —a typical Ruby way to do something with each element of a collection
puts f.filename — just outputs string in “filename” field of a FileEntry to the stdout, that is the screen

It shouldn’t be very hard to write mass extraction script, but I just want to note a couple of things before that:

There are path specifications in the “filename” field, and it uses “\” (backslash) as a folder separator due to the fact that archive was originally created on Windows system. If we’ll attempt to create such a path on UNIX system, it will obediently create us a directory with backslashes in its name, so it’s good idea to convert these “\” into “/” for calls like mkdir_p.
File names are actually zero-terminated (yeah, it looks like C, alright). It’s invisible when you just dump it on screen, but it may become a problem when you’ll try to create a file with a “\0” in the name.
If you’ll look a bit further into the listing, you’ll encounter stuff like that:

"SE\\00050_\x93d\x98b\x82P.ogg\x00""SE\\00080_\x83J\x81[\x83e\x83\x93.ogg\x00""SE\\00090_\x83`\x83\x83\x83C\x83\x80.ogg\x00""SE\\00130_\x83h\x83\x93\x83K\x83`\x83\x83\x82Q.ogg\x00""SE\\00160_\x91\x96\x82\xE8\x8B\x8E\x82\xE9\x82Q.ogg\x00"

Do you remember the beginning of the article, when I said that crazy Japanese programmers use Shift-JIS? That’s exactly it. They use files with Japanese characters in it. Let’s change “encoding: ASCII” to “encoding: SJIS” in our filename type description for that. Don’t forget to recompile ksy → rb, and, voila:

SE\00050_電話１.oggSE\00080_カーテン.oggSE\00090_チャイム.oggSE\00130_ドンガチャ２.oggSE\00160_走り去る２.ogg

Even if you don’t read Japanese, you can check out something like Google Translator to see that 電話 is actually “phone”, so chances are “SE\00050” is a sound of phone ringing.

Ultimately, our extraction script will look like this:

require 'fileutils'require_relative 'ykc'

EXTRACT_PATH = 'extracted'

ARGV.each { |ykc_fn|Ykc.from_file(ykc_fn).header.entries.each { |f|filename = f.filename.strip.encode('UTF-8').gsub("\\", '/')dirname = File.dirname(filename)FileUtils::mkdir_p("#{EXTRACT_PATH}/#{dirname}")File.write("#{EXTRACT_PATH}/#{filename}", f.body)}}

That’s a bit more than 2 lines, but nothing fancy goes on here either. We grab a list of command line arguments (this way, you can run it using something like ./extract-ykc *.ykc), and, again, for every container file we’re iterating over all file entries. We clean up the file name (stripping trailing zero, encoding it to UTF-8 and replacing backslashes with forward slashes), derive directory name (dirname), create the folder if it doesn’t exist (mkdir_p) and, finally, we dump the `f.body` contents there to a file.

Our job is complete. You can run the script to see what we’ll get. As we’ve predicted, images are really in .png format (and you can view them with any image viewer), and music and sounds are in .ogg (so you can listen to them with any player). For example, here’s the backgrounds that we’ve got in BG folder:

BG folder unpacked: backgrounds

And TA folder contains sprites which are overlaid over these backgrounds. For example, Mika looks like that:

TA/MIKA folder unpacked: sprites for Mika character

I can give out a little secret: in many Japanese VNs, “standing” sprites are named “ta” or “tati” / “tachi” / “tatsu” / “tatte”. That’s because Japanese word for “standing” is 立って (tatte) or “to stand up” is 立ち上がる (tachi ageru). That usually contrasts with “fa” or “face” sprites, which are actually used for avatar portraits in the dialogue text box, which show only character’s face.

That’s it for today. The two major things left to reverse engineer here are “yks” and “ykg” files — probably “yks” is the script of the game, and “ykg” are some aux graphics or animations. Let’s try to tackle them next time.

A few conclusions I’d like to share:

Kaitai Struct is a really nice tool and seems to match closely whatever’s you were doing without it, thus saving you lots of time. If you’re doing this in plain Ruby (or Python, or PHP, or whatever), at the very least you’ll have to do the whole thing twice: first, you write a script that outputs dumps to the screen, then you rewrite it to actually extract the data. If you’re employing “advanced” hex editors like Hexinator or 010 Editor — chances are you’d actually do this work thrice (yet another time you’ll have to write a template in your editor).
Kaitai Struct visualizer is very bare bones tool, it might be ugly and slow, but it’s still the best thing for the job I’ve encountered so far. Hope someday it will get a facelift :)
Actually you don’t even need a hex editor / viewer, after you’ve done the first rough sketch in .ksy. Walking through the tree in visualizer beats manual calculation of offsets anyway. But it’s still useful to have one for preliminary analysis + quick searches in .exe.
.ksy is a very expressive language (it beats “advanced” hex editor templates, hands down), but the documentation, is, well, somewhat lacking. I must confess that I kept pestering KS author with endless stream of questions for a week for writing this article. That trick with “io: _root._io” is nothing like you’ll come up with by yourself. I can just hope that documentation will become better as Kaitai Struct will be approaching v1.0.

A few useful links:

Kaitai Struct — http://kaitai.io/
PeasSoft — http://peassoft.com/ (warning, NSFW!)
One can find specifications and tools for a few visual novel engines (in .ksy format too) at my GitHub projects.
You can also read this article in Russian.