Questions about readDocumentStructure()

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Questions about readDocumentStructure()

vmassol
Hi guys,

I'm Vincent from the XWiki project, trying to add support for the asciidoc syntax in XWiki.

I tested with the following:

        Map<String,Object> parameters = new HashMap<>();
        parameters.put(Asciidoctor.STRUCTURE_MAX_LEVEL, 10);
        StructuredDocument document = this.asciidoctor.readDocumentStructure("This is *bold*", parameters);
        for (ContentPart part : document.getParts()) {
            System.out.println("part: " + part.getClass().getName());
        }

I have the following questions:

1) It prints "part content: This is <strong>bold</strong>". I don't understand why parsing the source returns HTML in the returned AST.
2) How could I get the sub-elements from ContentPart, i.e. the different works, the fact that the word "bold" is in bold, etc?

Thanks a lot
-Vincent
Reply | Threaded
Open this post in threaded view
|

Re: Questions about readDocumentStructure()

mojavelinux
Administrator
Vincent,

At the moment, there are two slightly competing AST models in AsciidoctorJ. For general purpose tree traversal, I recommend using the lower-level API, which can be reached using the load API.

Let's assume we are reading the following AsciiDoc source:

[source,asciidoc]
----
= Document Title

preamble

== Section A

section content with *bold*

== Section B

section content with *italic*
----

Here's how you can load and walk (naively) the tree structure:

[source,java]
----
String source = "...";
Asciidoctor asciidoctor = JRubyAsciidoctor.create();
Map<String,Object> options = new HashMap<String,Object>();
Document document = asciidoctor.load(source, options);
for (AbstractBlock block : document.blocks()) {
    System.out.println(":" + block.context());
    for (AbstractBlock childBlock : block.blocks()) {
        if (childBlock.context().equals("paragraph")) {
            System.out.println("  :paragraph, lines: " + ((Block) childBlock).lines());
        }
        else {
            System.out.println("  :" + childBlock.context());
        }
    }
}
----

NOTE: I'm not sure why the AbstractBlock and Block interfaces are not aligned (hence the cast). That looks like a bug in the AST. The Block interface gives you access to the raw lines (Collection) and raw source (String).

When you run the code in the previous listing, you should see the following output:

....
:preamble
  :paragraph, lines: ["preamble"]
:section
  :paragraph, lines: ["section content with *bold*"]
:section
  :paragraph, lines: ["section content with *italic*"]
....


The AST only provides access to block-level content (sections, paragraphs, figures, etc). You cannot access inline elements because they are not parsed into a tree in the underlying Ruby implementation. We do have plans to do it (see https://github.com/asciidoctor/asciidoctor/issues/61). As it stands, inline elements are converted as they are parsed (for example, see https://github.com/asciidoctor/asciidoctor/blob/master/lib/asciidoctor/substitutors.rb#L575). That's why the ContentPart currently gives you converted output (though it should provide lines() and source() methods too). 

Currently, not all of the AST methods in the Ruby API are mapped in AsciidoctorJ. For instance, the Ruby API provides access to the line number of a block, but that's not currently available via AsciidoctorJ. ContentPart should also provide access to the Block, where relevant. Let's make it happen.

To help us understand your use case, could you explain what XWiki needs to do with the AST? That way, we can focus on getting the features mapped or implemented that are needed.

I highly recommend that you look at the Asciidoctor Ruby API {1} to understand the "source of truth" beneath the AsciidoctorJ API.

Thanks for you rinput!

Cheers,

-Dan



On Fri, Sep 19, 2014 at 10:51 AM, vmassol [via Asciidoctor :: Discussion] <[hidden email]> wrote:
Hi guys,

I'm Vincent from the XWiki project, trying to add support for the asciidoc syntax in XWiki.

I tested with the following:

        Map<String,Object> parameters = new HashMap<>();
        parameters.put(Asciidoctor.STRUCTURE_MAX_LEVEL, 10);
        StructuredDocument document = this.asciidoctor.readDocumentStructure("This is *bold*", parameters);
        for (ContentPart part : document.getParts()) {
            System.out.println("part: " + part.getClass().getName());
        }

I have the following questions:

1) It prints "part content: This is <strong>bold</strong>". I don't understand why parsing the source returns HTML in the returned AST.
2) How could I get the sub-elements from ContentPart, i.e. the different works, the fact that the word "bold" is in bold, etc?

Thanks a lot
-Vincent


If you reply to this email, your message will be added to the discussion below:
http://discuss.asciidoctor.org/Questions-about-readDocumentStructure-tp2252.html
To start a new topic under Asciidoctor :: Discussion, email [hidden email]
To unsubscribe from Asciidoctor :: Discussion, click here.
NAML



--
Reply | Threaded
Open this post in threaded view
|

Re: Questions about readDocumentStructure()

vmassol
Hi Dan,

Thanks a lot for the help.

See below.

On 21 Sep 2014 at 09:22:40, mojavelinux [via Asciidoctor :: Discussion] ([hidden email](mailto:[hidden email])) wrote:

> Vincent,
>
> At the moment, there are two slightly competing AST models in AsciidoctorJ. For general purpose tree traversal, I recommend using the lower-level API, which can be reached using the load API.
>
> Let's assume we are reading the following AsciiDoc source:
>
> [source,asciidoc]
> ----
> = Document Title
>
> preamble
>
> == Section A
>
> section content with *bold*
>
> == Section B
>
> section content with *italic*
> ----
>
> Here's how you can load and walk (naively) the tree structure:
>
> [source,java]
> ----
> String source = "...";
> Asciidoctor asciidoctor = JRubyAsciidoctor.create();
> Map options = new HashMap();
> Document document = asciidoctor.load(source, options);
> for (AbstractBlock block : document.blocks()) {
> System.out.println(":" + block.context());
> for (AbstractBlock childBlock : block.blocks()) {
> if (childBlock.context().equals("paragraph")) {
> System.out.println(" :paragraph, lines: " + ((Block) childBlock).lines());
> }
> else {
> System.out.println(" :" + childBlock.context());
> }
> }
> }
>
> ----
>
> NOTE: I'm not sure why the AbstractBlock and Block interfaces are not aligned (hence the cast). That looks like a bug in the AST. The Block interface gives you access to the raw lines (Collection) and raw source (String).
>
> When you run the code in the previous listing, you should see the following output:
>
> ....
> :preamble
> :paragraph, lines: ["preamble"]
> :section
> :paragraph, lines: ["section content with *bold*"]
> :section
> :paragraph, lines: ["section content with *italic*"]
>
> ....
>
>
> The AST only provides access to block-level content (sections, paragraphs, figures, etc). You cannot access inline elements because they are not parsed into a tree in the underlying Ruby implementation. We do have plans to do it (see https://github.com/asciidoctor/asciidoctor/issues/61). As it stands, inline elements are converted as they are parsed (for example, see https://github.com/asciidoctor/asciidoctor/blob/master/lib/asciidoctor/substitutors.rb#L575). That's why the ContentPart currently gives you converted output (though it should provide lines() and source() methods too).
>
> Currently, not all of the AST methods in the Ruby API are mapped in AsciidoctorJ. For instance, the Ruby API provides access to the line number of a block, but that's not currently available via AsciidoctorJ. ContentPart should also provide access to the Block, where relevant. Let's make it happen.
>
> To help us understand your use case, could you explain what XWiki needs to do with the AST? That way, we can focus on getting the features mapped or implemented that are needed. 

The XWiki Rendering engine works like this (see http://rendering.xwiki.org for more details):
- There are Parsers for various syntaxes. The goal of a parser is to return an XDOM object which is the AST of the source. Note that we go down to the level of words, spaces and special symbols. For example “hello: Dan” will generate a ParagraphBlock with 4 children Blocks: a WordBlock with content “hello”, a SpecialSymbolBlock for ‘:’, a SpaceBlock for the space and another WordBlock for “Dan”.
- Then we have optional transformations to convert an XDOM into a modified XODM’. This is how we handle Macros for example. The Parser will generate a MacroBlock and a MacroTransformation will convert the MacroBlock into a List of Blocks by executing the content of the macro.
- Then we have Renderers for various output syntaxes that take an XDOM as input and generate the output for the syntax (HTML, PDF, etc).

Here what’s important is that I need to represent what AsciiDoctor gives me as an XDOM. Thus if I get only "section content with *bold*” then it won’t be good enough for me since I’ll need to parse this on my side, understanding the full AsciiDoctor syntax for all types of syntax elements: bolds, links, images, etc.

I can handle not getting an AST at the level of words but I’ll need to get source parts containing asciidoc syntax into Blocks (formatting, links, images, etc).

For example the following would be fine for "section content with *bold*”:

TextBlock(“section content with”)
BoldBlock()
  TextBlock(“bold”)

Thanks!
-Vincent

> I highly recommend that you look at the Asciidoctor Ruby API {1} to understand the "source of truth" beneath the AsciidoctorJ API.
>
> Thanks for you rinput!
>
> Cheers,
>
> -Dan
>
> :1: http://rubydoc.info/gems/asciidoctor/Asciidoctor/AbstractBlock
>
>
> On Fri, Sep 19, 2014 at 10:51 AM, vmassol [via Asciidoctor :: Discussion] <[hidden email](/user/SendEmail.jtp?type=node&node=2260&i=0)> wrote:
> > Hi guys,
> >
> > I'm Vincent from the XWiki project, trying to add support for the asciidoc syntax in XWiki.
> >
> > I tested with the following:
> >
> > Map parameters = new HashMap<>();
> > parameters.put(Asciidoctor.STRUCTURE_MAX_LEVEL, 10);
> > StructuredDocument document = this.asciidoctor.readDocumentStructure("This is *bold*", parameters);
> > for (ContentPart part : document.getParts()) {
> > System.out.println("part: " + part.getClass().getName());
> > }
> >
> > I have the following questions:
> >
> > 1) It prints "part content: This is bold". I don't understand why parsing the source returns HTML in the returned AST.
> > 2) How could I get the sub-elements from ContentPart, i.e. the different works, the fact that the word "bold" is in bold, etc?
> >
> > Thanks a lot
> > -Vincent
> >
> > If you reply to this email, your message will be added to the discussion below: http://discuss.asciidoctor.org/Questions-about-readDocumentStructure-tp2252.html
> > To start a new topic under Asciidoctor :: Discussion, email [hidden email](/user/SendEmail.jtp?type=node&node=2260&i=1)
> > To unsubscribe from Asciidoctor :: Discussion, click here.
> > NAML(http://discuss.asciidoctor.org/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml)
>
>
> --
> Dan Allen | http://google.com/profiles/dan.j.allen
>
> If you reply to this email, your message will be added to the discussion below: http://discuss.asciidoctor.org/Questions-about-readDocumentStructure-tp2252p2260.html
> To unsubscribe from Questions about readDocumentStructure(), click here( NAML(http://discuss.asciidoctor.org/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml)
Reply | Threaded
Open this post in threaded view
|

Re: Questions about readDocumentStructure()

mojavelinux
Administrator
Thanks Vincent. Your summary gives me a clear picture of what we need.

On Sun, Sep 21, 2014 at 9:21 AM, vmassol [via Asciidoctor :: Discussion] <[hidden email]> wrote:
Here what’s important is that I need to represent what AsciiDoctor gives me as an XDOM. Thus if I get only "section content with *bold*” then it won’t be good enough for me since I’ll need to parse this on my side, understanding the full AsciiDoctor syntax for all types of syntax elements: bolds, links, images, etc.

I can handle not getting an AST at the level of words but I’ll need to get source parts containing asciidoc syntax into Blocks (formatting, links, images, etc).

For example the following would be fine for "section content with *bold*”:

TextBlock(“section content with”)
BoldBlock()
  TextBlock(“bold”)

I think the strategy to take here is to start developing an inline parser in Asciidoctor alongside the existing streaming transformer. Once it's fully fleshed out, we can switch to it. But the benefit is that you start to get something to use that at least hits the major syntax sooner rather than later. In other words, we can roll it out gradually. I envision the inline parser to be something you can call on a given block. Keep in mind that not all blocks in Asciidoctor have parsed text, or the text is parsed differently, so it makes sense that it's available as an API (at least in the near term) on the node.

Cheers,

-Dan

--
Reply | Threaded
Open this post in threaded view
|

Re: Questions about readDocumentStructure()

vmassol
On 23 Sep 2014 at 22:24:06, mojavelinux [via Asciidoctor :: Discussion] ([hidden email](mailto:[hidden email])) wrote:

> Thanks Vincent. Your summary gives me a clear picture of what we need.
>  
> On Sun, Sep 21, 2014 at 9:21 AM, vmassol [via Asciidoctor :: Discussion] <[hidden email](/user/SendEmail.jtp?type=node&node=2272&i=0)> wrote:
> > Here what’s important is that I need to represent what AsciiDoctor gives me as an XDOM. Thus if I get only "section content with *bold*” then it won’t be good enough for me since I’ll need to parse this on my side, understanding the full AsciiDoctor syntax for all types of syntax elements: bolds, links, images, etc.  
> >  
> > I can handle not getting an AST at the level of words but I’ll need to get source parts containing asciidoc syntax into Blocks (formatting, links, images, etc).  
> >  
> > For example the following would be fine for "section content with *bold*”:  
> >  
> > TextBlock(“section content with”)  
> > BoldBlock()
> > TextBlock(“bold”)
>  
>  
>  
> I think the strategy to take here is to start developing an inline parser in Asciidoctor alongside the existing streaming transformer. Once it's fully fleshed out, we can switch to it. But the benefit is that you start to get something to use that at least hits the major syntax sooner rather than later. In other words, we can roll it out gradually. I envision the inline parser to be something you can call on a given block. Keep in mind that not all blocks in Asciidoctor have parsed text, or the text is parsed differently, so it makes sense that it's available as an API (at least in the near term) on the node. 

Makes sense.

Also note that XWiki Rendering supports streaming parsers/events. Is there in AsciiDoctorJ to be able to receive block events as they are parsed instead of calling readDocumentStructure() which puts everything in memory before returning and is thus less adapted for large chunk of text?

Thanks
-Vincent


Reply | Threaded
Open this post in threaded view
|

Re: Questions about readDocumentStructure()

mojavelinux
Administrator

On Tue, Sep 23, 2014 at 3:47 PM, vmassol [via Asciidoctor :: Discussion] <[hidden email]> wrote:
Is there in AsciiDoctorJ to be able to receive block events as they are parsed instead of calling readDocumentStructure() which puts everything in memory before returning and is thus less adapted for large chunk of text?

Not right now. We'd have to first add it to Asciidoctor core (Ruby) before we add it to AsciidoctorJ.