invalid byte sequence in UTF-8 (ArgumentError)

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
26 messages Options
12
Reply | Threaded
Open this post in threaded view
|

invalid byte sequence in UTF-8 (ArgumentError)

derek-jones
I am getting the error message below and want to find out where the invalid byte sequence is, so I can fix it.

The obvious solution is to trace the input lines being read, by printing them out as they are read.
Neither the -trace option or the --verbose option have this effect.

Does anybody have any suggestions for locating the offending bytes in my asciidoc input file?
I am running the latest release, downloaded from github today.

Obviously this is also a bug in Asciidoctor, it should either complain about the byte or handle it.

/usr1/expsrc/asciidoctor-master/bin> ./asciidoctor /usr1/rbook/T/twords/rbook.txt --trace --verbose --safe-mode secure
/usr1/expsrc/asciidoctor-master/lib/asciidoctor/parser.rb:726:in `=~': invalid byte sequence in UTF-8 (ArgumentError)
        from /usr1/expsrc/asciidoctor-master/lib/asciidoctor/parser.rb:726:in `block in next_block'
        from /usr1/expsrc/asciidoctor-master/lib/asciidoctor/reader.rb:454:in `read_lines_until'
        from /usr1/expsrc/asciidoctor-master/lib/asciidoctor/parser.rb:716:in `next_block'
        from /usr1/expsrc/asciidoctor-master/lib/asciidoctor/parser.rb:303:in `next_section'
        from /usr1/expsrc/asciidoctor-master/lib/asciidoctor/parser.rb:291:in `next_section'
        from /usr1/expsrc/asciidoctor-master/lib/asciidoctor/parser.rb:291:in `next_section'
        from /usr1/expsrc/asciidoctor-master/lib/asciidoctor/parser.rb:52:in `parse'
        from /usr1/expsrc/asciidoctor-master/lib/asciidoctor/document.rb:448:in `parse'
        from /usr1/expsrc/asciidoctor-master/lib/asciidoctor.rb:1337:in `load'
        from /usr1/expsrc/asciidoctor-master/lib/asciidoctor.rb:1415:in `convert'
        from /usr1/expsrc/asciidoctor-master/lib/asciidoctor/cli/invoker.rb:93:in `block in invoke!'
        from /usr1/expsrc/asciidoctor-master/lib/asciidoctor/cli/invoker.rb:85:in `each'
        from /usr1/expsrc/asciidoctor-master/lib/asciidoctor/cli/invoker.rb:85:in `invoke!'
        from ./asciidoctor:10:in `<main>'

Reply | Threaded
Open this post in threaded view
|

Re: invalid byte sequence in UTF-8 (ArgumentError)

Chris
This post was updated on .
With latest Asciidoctor v1.5.0 I get the same errror: "invalid byte sequence error in UTF-8" when I use german umlauts like "äüö" in my *adoc files.

Asciidoctor does not compile this very basic Example.adoc and commits the error message above.
So there's no way to compile any Asciidoc file with german umlauts to html. That's really annoying.

My system is WIndows 7 64 bit german version with Ruby v1.9.3p545.
Reply | Threaded
Open this post in threaded view
|

Re: invalid byte sequence in UTF-8 (ArgumentError)

LightGuardjp
Do you have test document, an actual file, not just the contents?

On Sunday, August 24, 2014, Chris [via Asciidoctor :: Discussion] <[hidden email]> wrote:
With latest Asciidoctor v1.5.0 I get the same errror: "invalid byte sequence error in UTF-8" when I use german umlauts like "äüö" in my *adoc files.

Example.adoc:

Test
äöü

Asciidoctor does not compile this very basic Example.adoc and commits the error message above.
So there's no way to compile any Asciidoc file with german umlauts to html. That's really annoying.

My system is WIndows 7 64 bit german version.


If you reply to this email, your message will be added to the discussion below:
http://discuss.asciidoctor.org/invalid-byte-sequence-in-UTF-8-ArgumentError-tp2003p2050.html
To start a new topic under Asciidoctor :: Discussion, email <a href="javascript:_e(%7B%7D,&#39;cvml&#39;,&#39;ml-node%2Bs49171n1h37@n6.nabble.com&#39;);" target="_blank">ml-node+s49171n1h37@...
To unsubscribe from Asciidoctor :: Discussion, click here.
NAML


--

Reply | Threaded
Open this post in threaded view
|

Re: invalid byte sequence in UTF-8 (ArgumentError)

Chris
Hi Jason,

Sublime Text 3:
http://www.file-upload.net/download-9424948/sublime_text.adoc.html
sublime_text.adoc -> asciidoctor v1.5.0 error message: "incompatible character encodings: UTF-8 and US-ASCII"

Windows 7 Editor (Notepad):
http://www.file-upload.net/download-9424949/windows_notepad.adoc.html
windows_notepad.adoc -> asciidoctor v1.5.0 error message: "invalid byte sequence in UTF-8"

Both didn't compile to html.
Reply | Threaded
Open this post in threaded view
|

Re: invalid byte sequence in UTF-8 (ArgumentError)

LightGuardjp
Huh. Not seeing the links. Oh well. Anyway, which version of ruby are you using? The second error is definitely the wrong BOM as the first character in the file. The first error, how you are able to have two encodings in the same file is odd. 

On Sunday, August 24, 2014, Chris [via Asciidoctor :: Discussion] <[hidden email]> wrote:
Hi Jason,

now there are two download links in my first post above with following example files:

Sublime Text 3:
sublime_text.adoc -> asciidoctor v1.5.0 error message: "incompatible character encodings: UTF-8 and US-ASCII"

Windows 7 Editor (Notepad):
windows_notepad.adoc -> asciidoctor v1.5.0 error message: "invalid byte sequence in UTF-8"


If you reply to this email, your message will be added to the discussion below:
http://discuss.asciidoctor.org/invalid-byte-sequence-in-UTF-8-ArgumentError-tp2003p2052.html
To start a new topic under Asciidoctor :: Discussion, email <a href="javascript:_e(%7B%7D,&#39;cvml&#39;,&#39;ml-node%2Bs49171n1h37@n6.nabble.com&#39;);" target="_blank">ml-node+s49171n1h37@...
To unsubscribe from Asciidoctor :: Discussion, click here.
NAML


--

Reply | Threaded
Open this post in threaded view
|

Re: invalid byte sequence in UTF-8 (ArgumentError)

mojavelinux
Administrator
In reply to this post by Chris
I want to be absolutely clear, because there's a lot of potential confusion around this subject. Asciidoctor fully supports UTF-8 and thus the entire set of characters defined by the Unicode specification (in other words, all characters).

When Asciidoctor has problems processing documents, it's a problem that is inherited from the misunderstanding between Ruby and the operating system.

We are at a point in global technology where all systems should be using UTF-8 (or UTF-16) by default. Linux has supported this mode for nearly a decade, if not more. Unfortunately, Windows seems to be stubborn about this topic and insists on defaulting to regional charsets. Since you're using a German version of Windows, and you are getting this error, I'm fairly certain your system charset is not configured as UTF-8.

Unfortunately, there's no (easy) way for Ruby to know that it's not getting a UTF-8 document, or that the system is not in UTF-8. To make matters more complicated, Ruby seems to be configured differently on Windows than it is on other operating systems. It's extremely rare that you would see this error on Linux, if at all.

To move forward, what we need to understand is what flags need to be set to get all the parts playing in the same UTF-8 sandbox. To start, make sure you save your text files using UTF-8 encoding. I think it's bad practice in the modern era to save text files any other way, thus I want to stay away from input / output encoding settings in Asciidoctor.

Once your document is encoding in UTF-8 (or it already is), then we need to get into the business of figuring out what settings we need to document so that Ruby is reading and writing the file as UTF-8, even if the system is not set to a UTF-8 locale. If we can get this properly documented, we should be able to confidently handle these types of problems in the future.

To close off this reply, I want to emphasize again that there is no code inside of Asciidoctor that would affect the processing of these characters. Asciidoctor assumes it's reading UTF-8 source and it writes UTF-8 output.

One way or another, we'll get this sorted out for sure!

-Dan


On Sun, Aug 24, 2014 at 12:22 AM, Chris [via Asciidoctor :: Discussion] <[hidden email]> wrote:
With latest Asciidoctor v1.5.0 I get the same errror: "invalid byte sequence error in UTF-8" when I use german umlauts like "äüö" in my *adoc files.

Example.adoc:

Test
äöü

Asciidoctor does not compile this very basic Example.adoc and commits the error message above.
So there's no way to compile any Asciidoc file with german umlauts to html. That's really annoying.

My system is WIndows 7 64 bit german version.


If you reply to this email, your message will be added to the discussion below:
http://discuss.asciidoctor.org/invalid-byte-sequence-in-UTF-8-ArgumentError-tp2003p2050.html
To start a new topic under Asciidoctor :: Discussion, email [hidden email]
To unsubscribe from Asciidoctor :: Discussion, click here.
NAML



--
Reply | Threaded
Open this post in threaded view
|

Re: invalid byte sequence in UTF-8 (ArgumentError)

Chris
In reply to this post by LightGuardjp
The links are in my second post.

Ruby v1.9.3p545 is installed on my Windows 7 64 bit (german) system.

1. "sublime_text.adoc" was created with "Sublime Text 3"
2. "windows_notepad.adoc" was created with Windows 7 Editor (Notepad)

Windows 7 Editor (Notepad) is on every Windows System and Sublime Text 3 is an often used Editor on Windows, Mac and Linux.

Is this an Editor related error?
Do you know an Editor for Windows which will work with Asciidoctor without errors?
Reply | Threaded
Open this post in threaded view
|

Re: invalid byte sequence in UTF-8 (ArgumentError)

mojavelinux
Administrator
In reply to this post by LightGuardjp
Chris and Jason,

The problem isn't necessarily the files themselves, it's the file + the system it's being processed on. Jason, if you downloaded this file and tried it, you won't get the same results as Chris because you are not running the German version of Windows (or Ruby on Windows for that matter).

However, what we do what to establish first and foremost is that the file is being saved with UTF-8 encoding. That is a prerequisite for Asciidoctor...because it's hard enough just getting that right we don't want to get into the business of mixing encodings...there's only pain and suffering down that path.

Once we have the document in UTF-8, then we need to make ensure that Ruby is reading and writing it as UTF-8. It should be, but we need to know more about your system.

Can you run:

 $ asciidoctor -v

and print the results.

I just thought of something. I should print the system charset in the asciidoctor -v output. That will be very helpful when debugging things. I'll add that feature to master.

-Dan


On Sun, Aug 24, 2014 at 12:47 AM, LightGuardjp [via Asciidoctor :: Discussion] <[hidden email]> wrote:
Huh. Not seeing the links. Oh well. Anyway, which version of ruby are you using? The second error is definitely the wrong BOM as the first character in the file. The first error, how you are able to have two encodings in the same file is odd. 


On Sunday, August 24, 2014, Chris [via Asciidoctor :: Discussion] <[hidden email]> wrote:
Hi Jason,

now there are two download links in my first post above with following example files:

Sublime Text 3:
sublime_text.adoc -> asciidoctor v1.5.0 error message: "incompatible character encodings: UTF-8 and US-ASCII"

Windows 7 Editor (Notepad):
windows_notepad.adoc -> asciidoctor v1.5.0 error message: "invalid byte sequence in UTF-8"


If you reply to this email, your message will be added to the discussion below:
http://discuss.asciidoctor.org/invalid-byte-sequence-in-UTF-8-ArgumentError-tp2003p2052.html
To start a new topic under Asciidoctor :: Discussion, email <a href="javascript:_e(%7B%7D,&#39;cvml&#39;,&#[hidden email]&#39;);" target="_blank">ml-node+s49171n1h37@...
To unsubscribe from Asciidoctor :: Discussion, click here.
NAML


If you reply to this email, your message will be added to the discussion below:
http://discuss.asciidoctor.org/invalid-byte-sequence-in-UTF-8-ArgumentError-tp2003p2053.html
To start a new topic under Asciidoctor :: Discussion, email [hidden email]
To unsubscribe from Asciidoctor :: Discussion, click here.
NAML



--
Reply | Threaded
Open this post in threaded view
|

Re: invalid byte sequence in UTF-8 (ArgumentError)

Chris
Hi Dan,

asciidoctor -v:
Asciidoctor 1.5.0 [http://asciidoctor.org]
Runtime Environment (ruby 1.9.3p545 (2014-02-24) [i386-mingw32])
Reply | Threaded
Open this post in threaded view
|

Re: invalid byte sequence in UTF-8 (ArgumentError)

mojavelinux
Administrator
In reply to this post by Chris
Chris,

As I mention in the AsciiDoc Writer's Guide, I strongly recommend against using Notepad. It is a seriously broken program on so many levels. It's true that it's on every Windows system, but gum is on every sidewalk, doesn't mean you should chew it...if you know what I'm saying :)

Other Windows users typically recommend Notepad++. However, I find it to be a cluttered user interface. These days, I'm recommending the Atom editor developed by GitHub and community. Atom is cross platform, it uses modern web technologies under the hood and it even has an AsciiDoc preview plugin based on Asciidoctor!


You won't be disappointed. The one catch is that you have to build it on Windows to install it, but there are instructions that hopefully make that reasonably straightforward.

One more thing, could you run the following command and paste the output.

 $ ruby -e 'puts [Encoding.default_external,Encoding.default_internal,"".encoding,__ENCODING__] * ","'

If you get an error, try this one:

$ ruby -e 'puts [Encoding.default_external,Encoding.default_internal,"".encoding] * ","'

-Dan



On Sun, Aug 24, 2014 at 12:56 AM, Chris [via Asciidoctor :: Discussion] <[hidden email]> wrote:
The links are in my second post.

1. "sublime_text.adoc" was created with "Sublime Text 3"
2. "windows_notepad.adoc" was created with Windows 7 Editor (Notepad)

Windows 7 Editor (Notepad) is on every Windows System and Sublime Text 3 is an often used Editor on Windows, Mac and Linux.

Is this an Editor related error?
Do you know an Editor for Windows which will work with Asciidoctor without errors?


If you reply to this email, your message will be added to the discussion below:
http://discuss.asciidoctor.org/invalid-byte-sequence-in-UTF-8-ArgumentError-tp2003p2055.html
To start a new topic under Asciidoctor :: Discussion, email [hidden email]
To unsubscribe from Asciidoctor :: Discussion, click here.
NAML



--
Reply | Threaded
Open this post in threaded view
|

Re: invalid byte sequence in UTF-8 (ArgumentError)

mojavelinux
Administrator
In reply to this post by Chris
If Atom doesn't work for you, try Brackets. It's very similar to Atom, and also has AsciiDoc support, except it also has a Windows MSI installer :)


-Dan


On Sun, Aug 24, 2014 at 1:10 AM, Dan Allen <[hidden email]> wrote:
Chris,

As I mention in the AsciiDoc Writer's Guide, I strongly recommend against using Notepad. It is a seriously broken program on so many levels. It's true that it's on every Windows system, but gum is on every sidewalk, doesn't mean you should chew it...if you know what I'm saying :)

Other Windows users typically recommend Notepad++. However, I find it to be a cluttered user interface. These days, I'm recommending the Atom editor developed by GitHub and community. Atom is cross platform, it uses modern web technologies under the hood and it even has an AsciiDoc preview plugin based on Asciidoctor!


You won't be disappointed. The one catch is that you have to build it on Windows to install it, but there are instructions that hopefully make that reasonably straightforward.

One more thing, could you run the following command and paste the output.

 $ ruby -e 'puts [Encoding.default_external,Encoding.default_internal,"".encoding,__ENCODING__] * ","'

If you get an error, try this one:

$ ruby -e 'puts [Encoding.default_external,Encoding.default_internal,"".encoding] * ","'

-Dan



On Sun, Aug 24, 2014 at 12:56 AM, Chris [via Asciidoctor :: Discussion] <[hidden email]> wrote:
The links are in my second post.

1. "sublime_text.adoc" was created with "Sublime Text 3"
2. "windows_notepad.adoc" was created with Windows 7 Editor (Notepad)

Windows 7 Editor (Notepad) is on every Windows System and Sublime Text 3 is an often used Editor on Windows, Mac and Linux.

Is this an Editor related error?
Do you know an Editor for Windows which will work with Asciidoctor without errors?


If you reply to this email, your message will be added to the discussion below:
http://discuss.asciidoctor.org/invalid-byte-sequence-in-UTF-8-ArgumentError-tp2003p2055.html
To start a new topic under Asciidoctor :: Discussion, email [hidden email]
To unsubscribe from Asciidoctor :: Discussion, click here.
NAML






--
Reply | Threaded
Open this post in threaded view
|

Re: invalid byte sequence in UTF-8 (ArgumentError)

mojavelinux
Administrator
In reply to this post by Chris

On Sun, Aug 24, 2014 at 1:05 AM, Chris [via Asciidoctor :: Discussion] <[hidden email]> wrote:
ruby 1.9.3p545

This could be part of the encoding problem. Ruby 1.9.3 got encoding all kinds of wrong (in some ways worse than Ruby 1.8.7). Thankfully, they got it right in Ruby 2!

I strongly recommend Ruby 2.0.0 (see http://rubyinstaller.org/). Now that the Nokogiri gem runs on Ruby 2.0.0 on Windows, you shouldn't have any problem with using this version of Ruby.

Reply | Threaded
Open this post in threaded view
|

Re: invalid byte sequence in UTF-8 (ArgumentError)

mojavelinux
Administrator
In reply to this post by Chris
Keep in mind that Asciidoctor supports Ruby 1.8.7 and above with no problem. It's just that Ruby 1.9.3 brings it's own baggage...and Asciidoctor inherits that. Ruby 1.9.3 works just find when everything is UTF-8, including the system. When it's not, things go sideways. It was because of these problems that the Ruby developers learned about encoding and fixed their ways in Ruby 2.

-Dan


On Sun, Aug 24, 2014 at 1:14 AM, Dan Allen <[hidden email]> wrote:

On Sun, Aug 24, 2014 at 1:05 AM, Chris [via Asciidoctor :: Discussion] <[hidden email]> wrote:
ruby 1.9.3p545

This could be part of the encoding problem. Ruby 1.9.3 got encoding all kinds of wrong (in some ways worse than Ruby 1.8.7). Thankfully, they got it right in Ruby 2!

I strongly recommend Ruby 2.0.0 (see http://rubyinstaller.org/). Now that the Nokogiri gem runs on Ruby 2.0.0 on Windows, you shouldn't have any problem with using this version of Ruby.




--
Reply | Threaded
Open this post in threaded view
|

Re: invalid byte sequence in UTF-8 (ArgumentError)

Chris
ruby -e 'puts [Encoding.default_external,Encoding.default_internal,"".encoding,__ENCODING__] * ","' :

CP850,,CP850,CP850
Reply | Threaded
Open this post in threaded view
|

Re: invalid byte sequence in UTF-8 (ArgumentError)

mojavelinux
Administrator

On Sun, Aug 24, 2014 at 1:18 AM, Chris [via Asciidoctor :: Discussion] <[hidden email]> wrote:
ruby -e 'puts [Encoding.default_external,Encoding.default_internal,"".encoding,__ENCODING__] * ","' :

CP850,,CP850,CP850

Just as I had suspected. Grrr, Windows.

At least now I know that I need to install a non-English version of Windows to get an environment that has a non-UTF-8 setup to test.

I'm not sure what the right argument is to change that, at least in the Ruby environment, but it may be either the `-E` flag for Ruby or adding the line:

 Encoding.default_external = "UTF-8"

To the top of the asciidoctor.rb script. I had considered doing this at some point, but it's not recommended that gems mess with this setting. Perhaps I can put it in the asciidoctor command though.

-Dan
Reply | Threaded
Open this post in threaded view
|

Re: invalid byte sequence in UTF-8 (ArgumentError)

Chris
This post was updated on .
I followed your recommendation and updated to Ruby v2.0.0p481 via the windows installer (http://rubyinstaller.org).

After that the Sublime Text 3 Asciidoctor file (sublime_text.adoc) compiles without an error!

The windows_notepad.adoc example file got the same error message as before but I don't care because I don't use that crappy editor anyway with Asciidoctor.

Many thanks for your help Dan!
Reply | Threaded
Open this post in threaded view
|

Re: invalid byte sequence in UTF-8 (ArgumentError)

Chris
The only small error which is left is shown in my Firefox browser. The Asciidoctor example file compiled to html shows:

Test äöü

Last updated 2014-08-24 08:36:42 Mitteleuropõische Sommerzeit

In the last line it should say "Mitteleuropäische Sommerzeit", not "Mitteleuropõische". Don't know what kind of issue that is.
Reply | Threaded
Open this post in threaded view
|

Re: invalid byte sequence in UTF-8 (ArgumentError)

mojavelinux
Administrator
In reply to this post by Chris
\o/

I'll be sure to do some testing on a German version of Windows so I'm well informed about the circumstances. What's important is that you can proceed!!

-Dan


On Sun, Aug 24, 2014 at 1:39 AM, Chris [via Asciidoctor :: Discussion] <[hidden email]> wrote:
I updated Ruby to v2.0.0 and the Sublime Text 3 Asciidoctor file (sublime_text.adoc) compiles without an error!
The windows_notepad.adoc example file got the same errors as before but I don't care because I don't use that editor anyway.

Many thanks for your help Dan!


If you reply to this email, your message will be added to the discussion below:
http://discuss.asciidoctor.org/invalid-byte-sequence-in-UTF-8-ArgumentError-tp2003p2064.html
To start a new topic under Asciidoctor :: Discussion, email [hidden email]
To unsubscribe from Asciidoctor :: Discussion, click here.
NAML



--
Reply | Threaded
Open this post in threaded view
|

Re: invalid byte sequence in UTF-8 (ArgumentError)

mojavelinux
Administrator
In reply to this post by Chris

On Sun, Aug 24, 2014 at 1:44 AM, Chris [via Asciidoctor :: Discussion] <[hidden email]> wrote:
Last updated 2014-08-24 08:36:42 Mitteleuropõische Sommerzeit

That looks like a bug in Ruby because that's supposed to be printing the timezone. It's read from the following line of Ruby:

::Time.now.strftime('%H:%M:%S %Z')

If you get the same result when running:

 $ ruby -e "puts ::Time.now.strftime('%H:%M:%S %Z')"

that's on Ruby.

You can disable that text by passing `-a nofooter` to the `asciidoctor` command.

Reply | Threaded
Open this post in threaded view
|

Re: invalid byte sequence in UTF-8 (ArgumentError)

Chris
Don't know why but the following command

ruby -e "puts ::Time.now.strftime('%H:%M:%S %Z')"
::

outputs only two colons.

The switch asciidoctor -a nofooter works fine. Thanks!
12